Object System

Jason Cairns


Distributed data needs some form of representation to be tangible and useful to the client. While many means of representation may exist, the standard mechanism is to use an object system for representation of any complex data. Benefits of an object system over less structured data is a well-explored topic of object-oriented programming, and includes capabilities such as polymorphism and encapsulation, which are extremely useful for data so precarious as distributed data.

With the roots of the language in scheme, composing an object system for R is a trivial matter. This fact has been taken to it’s utmost conclusion, with multiple object systems shipping with the base product, these too dwarfed in number by the object systems provided externally through packages. The simplest system offered by R is known as S3. S3 comprises of tagging objects with a class attribute, combined with generic functions dispatching based on methods defined for the class.

The challenge arises of whether to represent distributed data as objects, and how to do so.

The technical term “distributed object” is relevent in this case, though it refers to a very specific form of objectification for distributed data, not just any object. Specifically, distributed objects are a means of access to objects on a distributed system. They typically take the form of a reference (stub) that acts as a transparent handle to fragmented referents (skeletons) over a distributed system. Details of their methods of interaction can vary enormously; distributed objects can exist anywhere on the spectrum of lazy/eager evaluation, for example. Greater transparency in distributed objects is exemplified best in R with pbdDMAT, which provides distributed matrix objects, implementing nearly all standard Matrix methods on them.

A prototype implementation of distributed objects, focussing on vectors, can be found in the appendix at R/experiment-eager-dist-obj.R.

The benefits of distributed objects grow commensurately with their degree of transparency. At the closest state to ideal, a distributed object would be manipulated equivalently to its local equivalent.

Experience has found the state of transparency to be impossible to achieve completely; ultimately, it is an abstraction, replete with the leaks inherent in such a physically-dependent abstraction. This was noted with respect to pbdR in the previous chapter.

There also exists strong skepticism toward distributed objects by some commentators [1], [2], with Martin Fowler declaring his First Law of Distributed Object Design;

don’t distribute your objects [3]

In spite of such criticism, it remains a widely used notion, with plenty of examples of effective real world usage. The reasoning against distributed objects is based on the abstraction leaks inherent, and the unreliability involved by them. However, with cohesive data distributed continuously over multiple nodes, a reference is required in order to operate over the data. This reference, hosting complex data, is best maintained as part of an object system. In this way, disributed objects become inevitable. The remaining question is how to minimise the abstraction leaks, and build such a system for distributed objects.

Firstly, an appropriate level of abstraction is essential for distributed data. This requires an understanding of who the audience is for the framework. The problem statement details “statisticians” as the key audience - implying a high level of abstraction. A statistician has modelling as a primary concern, rather than the precise details of optimisation and the like that a computer scientist may have. As such, the details of distributed objects need to be as opaque as possible. However, the development of such a system entails an internal API for ease of maintenance - this leads to a low-level API as a spinoff that accidentally serves a beneficial purpose of allowing finer-grained control over objects by developers. Concretely, a dividing line is given in addressing of distributed chunks of data. A statistician isn’t primarily concerned with such details; the concern is on the data as a whole, not subsets circumscribed by infrastructure limitations. This leads to a layer in which the details of chunk addresses are abstracted over, with the API providing a means of access and manipulation of the data as a singular distributed object. Development of this layer, and those wanting to perform lower-level calculations, requires direct addressing of chunks, including interface requests of individual operations on singular chunks. This would be a lower-level “developer layer”.

The structure of the distributed objects is therefore defined in terms of the bottom-up abstraction - a distributed object is a collection of chunks. This has the potential disadvantage of curtailing optimisations that may be performed if distributed objects were defined in their own right.

Distributed data commonly breaks the abstraction whenever data needs to be transferred to or from the client node. The cost of transfer is typically so great that were transfers to be completely opaque, it would slow modelling to the point that it is simply a hindrance. As such, it is common to manually specify transfers to and from the client node, even at the highest level.

The main system of objects may entail a greater emphasis on the storage of data, or the methods defined for such data. This can be demonstrated through the three central operations on distributed data from the client node - the operations of distribution, running routines on the distributed data, and collecting results of the operations. These could be performed in myriad forms, and are entirely described by the object system. Consider Sparklyr as an example; The mechanism of distribution is manually described, and the user is provided with a Spark dataframe object as a local reference to the distributed data. This object abstracts entirely any details of the data distribution. Routines on the data are converted to SQL and stored as statements along with the spark dataframe objects. They are not run immediately, but stored for later. This laziness aids efficiency, as several consecutive operations can be slower than one single combined operation. The statements are then run upon a collection request, being performed in distributed fashion immediately prior to collection. Importantly, the runtime details may be treated as irrelevant to the client, as the structure of even non-distributed data remains opaque until it is accessed directly.

This object system places greater emphasis on the procedure, rather than the data, to the client. Furthermore, information on the procedures is stored alongside the object. This information is used to determine operations on the object.

Several challenges are met by the example of Sparklyr; notably, the problem of how to distribute and how to collect, and what operations to be performed.

Part of the problem is made implicitly easier, in that the only distributed objects under consideration are represented and manipulated as data frames. The challenge remains of how to properly collect distributed objects referencing arbitrary underlying classes. For example, when collecting chunks of some class, the means of combination on the client may vary depending on the class. Flexible use of the object system in order to enable arbitrary classes, is therefore a key consideration for largescaler.

Thus, the use of the object system to enable distributed operations, along with arbitrary underlying classes, is a major definitive challenge.

J. Waldo, G. Wyant, A. Wollrath, and S. Kendall, “A note on distributed computing,” in International workshop on mobile object systems, 1996, pp. 49–64.
A. Rotem-Gal-Oz, “Fallacies of distributed computing explained,” URL http://www. rgoarchitects. com/Files/fallacies. pdf, vol. 20, 2006.
M. Fowler, Patterns of enterprise application architecture. Boston: Addison-Wesley, 2003.