Home

Mutability & Garbage Collection

Jason Cairns

2022-12-03

largescaler:
- Chunk operations generate new references, may be multiple references to single mutable object.
- Reference GC -> eventual chunk GC
- Generation DAG stored with ChunkReference, can clear graph when proven computed (emerge); references deleted using R’s
GC.
Generation DAG
- Iteration OK when emerging, other forms include distributed Reduce etc.
- How to rely on gc when references so small
- mutability-emulation.md
- [Considerations on a Functional Iterative Interface](consideration-functional-iteration-interface.html

In the world of distributed objects in a dynamic language, both modification and deletion are first hinted at in relevence by the nature of chunk references. Before their consideration, several questions must be raised regarding chunk references: how to maintain a constant reference to chunks? Can a chunk have multiple references? Can multiple chunks share a reference? How to claim that a chunk, once modified, can maintain the same reference? And what to do when a chunk no longer has a reference? Such questions lead to the topics of mutability and garbage collection.

In discussion of memory usage, object mutability becomes highly relevant, and in the face of development for a distributed system, relevant often in surprising aspects. Mutability of objects refers to the ability to change an objects state; immutable objects are unable to be modified post-creation. Absent distributed data, in the standard case, both have implications for memory. An immutable object may have multiple references to it, each reference safe from accidental state changes by another reference, through the guarantees provided by immutability. This means that memory can be saved effectively through lacking a need to copy some objects. Mutable objects have benefits for memory in another manner, however, in that copies don’t have to be made if state needs to be changed. This capacity for changing state in an object does come at the cost of additional complexity, which has to be effectively managed. Different algorithms call for different mutability in objects, either paradigm having its best use, depending on what the situation demands.

Modification of objects sits adjacently as a concept to deletion of objects. The manner of dealing with deleted objects through using garbage collection is also worth discussion. Effectively all interactive programming languages make use of automatic memory management, freeing the programmer from direct concern over memory leaks. Garbage collection is the standard form of automatic memory management, and is the means that R takes, so shall serve to direct the discussion. In spite of helping to avoid memory bugs, garbage collection does have several drawbacks, most notably that of the overhead required to calculate which objects or regions of memory to reclaim, as well as the potentially unpredictable nature of when a garbage collection occurs. There are several strategies by which garbage collection may take place, though all rely upon determing whether objects may still be referenced or reached by other objects; when unreachable or unreferenced, the object may be collected.

In answer of the first few questions posed in this section, the mapping of references to chunks may theoretically take many different forms. For the sake of simplicity, nearly all distributed systems maintain a one-to-one mapping. The benefits of mapping multiple references to the same chunk are conceivable, particularly where memory is concerned, however would have to be carefully approached were the underlying data mutable, lest state grow wildly out of control. The benefits of mapping multiple chunks to the same reference are far from clear, and it becomes more obvious why it is not commonly done, the more it is considered.

Modification of a chunk doesn’t necessarily entail a change in the reference, in the same way that modification of a mutable object doesn’t necessarily entail a change in the identifier. However, systems such as Sparklyr with it’s RDD API, have all data as immutable, so any modifications to data require the creation of new objects to represent the change in state, and thereby an entirely new reference. Such a constraint massively simplifies the direction of an approach, yet restricts any possible advantage that could be gained through mutable objects.

Iteration is one exemplary workload where such an approach may be tested in usefulness. In a loop where an object is continually being updated and modified, a mutable data structure handles this easily. Immutable data quickly balloons the necessary memory required, unless memory is tightly managed. With dynamic languages being the focus of discussion, this is equivalent to garbage collection. Herein lies the greatest defect of most of these systems: a lack of capacity for iteration, which is due to a lack of mutability or garbage collection. Without iteration, arbitrary statistical models are also out of reach.

Returning to Sparklyr, the lack of simple persistence and deletion of objects serves as a significant shortcoming in this respect. Users must manually persist() objects, as well as delete. Garbage collection sits on top of the Scala, in turn Java, garbage collection. Were the system generalised to include mutable underlying datatypes, a new conception of persistence and deletion would be required.