Discussion on the Chunk Building Block

Jason Cairns


1 Introduction

Over the course of extensive prototyping, testing, and research towards the construction of an R-based distributed system for large-scale statistical modelling, the “chunk” concept has been imperative to intuitive and clean usage of the system. This document outlines the “chunk”, including underlying theory and implementation suggestions.

The term “chunk” is used here to refer to discrete portions of data that may be distributed across computing nodes. It has its concrete realisation in object-oriented terms as some instance implementing the interface of a “chunk” type, with further details in sec. 2.

The provenance of the concept can be found in many programs operating with distributed partitioning schemes, including MySQL Cluster and the Hadoop Distributed File System [1], [2].

2 Interface

The chunk as a type has at minimum two operations for its interface: data(), and do(), as shown in tbl. 1. These correspond to access and execution on an abstract chunk, where data() returns the underlying data encapsulated by the chunk, and do() takes a function and a variable number of chunk arguments, returning a chunk representing the result. do() and data() are intimately cohered, in that the data() function must be called at some point to access the underlying data for the actual calling of whatever function is given to the do() function, and the result of the do() operation can be accessed only through data().

Table 1: Interface for the chunk type
Responsibility Operation Return Type
Data access data(chunk) <data>
Execution on Data do(function, ...chunk) <chunk>

3 Suggestions for Implementation

The implementation of such an interface strongly depends on the fact that the data underlying a chunk may be in one of several different states.

Most notably, an instance of a chunk may be returned by do(), whose underlying data may still be computed either concurrently, or at some point in the future; the limitation of present data availability has purposely not been placed, in order that concurrent operation scheduling may be dynamic. With this in mind, the chunk will adopt different behaviours internally depending on the status of the data; for example, data that is still being computed will not allow immediate access via data(), and may require communications set up in order to be transferred, while fixed and pre-computed data may be immediately accessible or even cached locally.

A simple subclassed form of this particular implementation is given in fig. 1.

Figure 1: Subclassed implementation of the chunk interface

This meets the Liskov substitution principle[3], however doesn’t retain flexibility for other forms of state; here the whole of the chunk object’s state is not satisfactorily captured entirely by the chunk – there may be other aspects of state, such as a command history, last cache, or others, that would demand subclasses for consistency, yet lead to a multiplicative proliferation of subclasses if implemented.

Therefore, an alternative is presented, wherein the data state is delegated and thereby encapsulated, as given in fig. 2. Assuming there is some set of mutually exclusive state spaces S, each containing some set of states NS, the growth of required subclasses to capture interacting state is given by 𝒪(∑SNS)!, whereas delegation leads to growth proportional only to the number of state spaces, at 𝒪(S).

Figure 2: Delegated implementation of the chunk interface

An example of additional states to motivate the need for multiple forms of state-capture is given in the following example: In the service of data resiliance in the face of likely machine failure, the underlying data on the machine can be written to disk and restored following failure. Assuming immutability of the data, once it has been written, it has no need to be re-written. The only data that requires writing is that resulting from operations on the original data. Imagining a log is held of data operations, with one datum resulting in another through some operation, long chains of parent-child data may be created. Depending on the granularity, portions of these chains may heuristically be skipped for writing to disk, in order to save time, where minimisation of disk write time is to be balanced by the probability of machine failure over the time spent performing operations. The maintenance of a command history and current state of storage of data, may be captured in the chunk by another form of state, which can separately trigger further disk writes. This is depicted in fig. 3

Figure 3: Motivation for delegation in the implementation of chunk interface with chunk history being saved as a state

With respect to all of the class descriptions, this remains only a general description of the architecture; non class-based approaches that still follow a similar pattern of delegation may very well be preferable, especially given that the language of choice for implementation is , which possesses many dynamic features that can substitute for these state object compositions, such as closures as part of higher-order functions.

4 Layering over the Chunk: An Interface for Chunk Aggregation

The value of distributing objects across machines finds greater value when an object larger than the memory limits of any remote machine can be split up into multiple chunks that each hold some constituent portion of the underlying data. The notion of a composite chunk is therefore introduced, in order to capture the idea of an object being composed of multiple chunks. This is equivalent in many respects to the distributed object as used in earlier prototypes, but serves a far more general role.

A composite chunk is a special type of chunk that possesses an ordered list of children, and can therefore be recursively composed in a tree structure. Methods for combination of children, which either overrides the data() method or is delegated thereto, would be required. It is conceivable that the same could be required of the do() method, where different representations of chunk composition would alter standard requested functions. Example variations of composite chunks include row-block matrices, column-block matrices, ordered lists, unordered lists (sets), etc. The implementation of a new composite chunk can involve the creation of a new subclass of composite chunk, and perhaps some subclass variation of an atomic chunk.

The flexibility granted by such variations in composition allow for greater matching of data structures to parallel statistical procedures. The simplest example is given by blockwise parallel matrix multiplication, but more advanced statistical procedures possess various parallel data structure mappings. For instance, it has been shown that least squares can be made very efficiently parallel by decomposing the XTX matrix by columns into disjoint blocks [4]. This can be contrasted with the row-wise splitting that has found success in GLMs and GAMS as special cases of non-linear optimisation [5]. Multi-chain MCMC methods can also benefit from parallelisation, with each chain being run in parallel, though no special ordering need be defined on each chain [6]. A variety of other parallel and distributed statistical procedures with mappings to composite chunk structures have further derivations, most being well-served by variations on the composite chunk[7].

Chunk aggregation through composite chunks exists as an interaction layer above raw singular chunk manipulation. Type safety is added through supplemental child manipulation methods, though this is exists in tension with the aim of a uniform interface, and whether the child manipulation methods exist in the compositeChunk class or the abstract chunk class determines the balance of favour. An example implementation of a simple family of composite chunks is given in fig. 4. Conceivably, further layers may be added to hide details such as the blocking format of matrices, allowing clients to only be concerned with matrices qua matrices.

Figure 4: Implementation of a simple composite chunk family, including row-blocked matrices, column-blocked matrices, ordered lists, sets, as well as forms of atomic chunks that may have special methods for differentially blocked matrices, including column and row chunks
K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), 2010, pp. 1–10.
M. Kruckenberg and J. Pipes, “Cluster,” in Pro MySQL, Apress, 2005, pp. 617–644. doi: 10.1007/978-1-4302-0048-2_19.
B. Liskov, “Keynote address - data abstraction and hierarchy,” ACM SIGPLAN Notices, vol. 23, no. 5, pp. 17–34, May 1988, doi: 10.1145/62139.62141.
R. A. Renaut, “A parallel multisplitting solution of the least squares problem,” Numerical linear algebra with applications, vol. 5, no. 1, pp. 11–31, 1998.
N. R. Suri, D. Deodhare, and P. Nagabhushan, “Parallel levenberg-marquardt-based neural network training on linux clusters-a case study.” in ICVGIP, 2002, pp. 1–6.
I. Strid, “Efficient parallelisation of metropolishastings algorithms using a prefetching approach,” Computational Statistics & Data Analysis, vol. 54, no. 11, pp. 2814–2835, Nov. 2010, doi: 10.1016/j.csda.2009.11.019.
G. Guo, “Parallel statistical computing for statistical inference,” Journal of Statistical Theory and Practice, vol. 6, no. 3, pp. 536–565, Sep. 2012, doi: 10.1080/15598608.2012.695705.