2021-08-10

Over the course of extensive prototyping, testing, and research towards the construction of an R-based distributed system for large-scale statistical modelling, the “chunk” concept has been imperative to intuitive and clean usage of the system. This document outlines the “chunk”, including underlying theory and implementation suggestions.

The term “chunk” is used here to refer to discrete portions of data that may be distributed across computing nodes. It has its concrete realisation in object-oriented terms as some instance implementing the interface of a “chunk” type, with further details in sec. 2.

The provenance of the concept can be found in many programs operating with distributed partitioning schemes, including MySQL Cluster and the Hadoop Distributed File System [1], [2].

The chunk as a type has at minimum two operations for its interface: `data()`

, and `do()`

, as shown in tbl. 1. These correspond to access and execution on an abstract chunk, where `data()`

returns the underlying data encapsulated by the chunk, and `do()`

takes a function and a variable number of chunk arguments, returning a chunk representing the result. `do()`

and `data()`

are intimately cohered, in that the `data()`

function must be called at some point to access the underlying data for the actual calling of whatever function is given to the `do()`

function, and the result of the `do()`

operation can be accessed only through `data()`

.

Responsibility | Operation | Return Type |
---|---|---|

Data access | `data(chunk)` |
`<data>` |

Execution on Data | `do(function, ...chunk)` |
`<chunk>` |

The implementation of such an interface strongly depends on the fact that the data underlying a chunk may be in one of several different states.

Most notably, an instance of a `chunk`

may be returned by `do()`

, whose underlying data may still be computed either concurrently, or at some point in the future; the limitation of present data availability has purposely not been placed, in order that concurrent operation scheduling may be dynamic. With this in mind, the chunk will adopt different behaviours internally depending on the status of the data; for example, data that is still being computed will not allow immediate access via `data()`

, and may require communications set up in order to be transferred, while fixed and pre-computed data may be immediately accessible or even cached locally.

A simple subclassed form of this particular implementation is given in fig. 1.

This meets the Liskov substitution principle[3], however doesn’t retain flexibility for other forms of state; here the whole of the chunk object’s state is not satisfactorily captured entirely by the chunk – there may be other aspects of state, such as a command history, last cache, or others, that would demand subclasses for consistency, yet lead to a multiplicative proliferation of subclasses if implemented.

Therefore, an alternative is presented, wherein the data state is delegated and thereby encapsulated, as given in fig. 2. Assuming there is some set of mutually exclusive state spaces *S*, each containing some set of states *N*_{S}, the growth of required subclasses to capture interacting state is given by 𝒪(∑_{S}*N*_{S})!, whereas delegation leads to growth proportional only to the number of state spaces, at 𝒪(*S*).

An example of additional states to motivate the need for multiple forms of state-capture is given in the following example: In the service of data resiliance in the face of likely machine failure, the underlying data on the machine can be written to disk and restored following failure. Assuming immutability of the data, once it has been written, it has no need to be re-written. The only data that requires writing is that resulting from operations on the original data. Imagining a log is held of data operations, with one datum resulting in another through some operation, long chains of parent-child data may be created. Depending on the granularity, portions of these chains may heuristically be skipped for writing to disk, in order to save time, where minimisation of disk write time is to be balanced by the probability of machine failure over the time spent performing operations. The maintenance of a command history and current state of storage of data, may be captured in the chunk by another form of state, which can separately trigger further disk writes. This is depicted in fig. 3

With respect to all of the class descriptions, this remains only a general description of the architecture; non class-based approaches that still follow a similar pattern of delegation may very well be preferable, especially given that the language of choice for implementation is , which possesses many dynamic features that can substitute for these state object compositions, such as closures as part of higher-order functions.

The value of distributing objects across machines finds greater value when an object larger than the memory limits of any remote machine can be split up into multiple chunks that each hold some constituent portion of the underlying data. The notion of a *composite chunk* is therefore introduced, in order to capture the idea of an object being composed of multiple chunks. This is equivalent in many respects to the *distributed object* as used in earlier prototypes, but serves a far more general role.

A composite chunk is a special type of chunk that possesses an ordered list of children, and can therefore be recursively composed in a tree structure. Methods for combination of children, which either overrides the `data()`

method or is delegated thereto, would be required. It is conceivable that the same could be required of the `do()`

method, where different representations of chunk composition would alter standard requested functions. Example variations of composite chunks include row-block matrices, column-block matrices, ordered lists, unordered lists (sets), etc. The implementation of a new composite chunk can involve the creation of a new subclass of composite chunk, and perhaps some subclass variation of an atomic chunk.

The flexibility granted by such variations in composition allow for greater matching of data structures to parallel statistical procedures. The simplest example is given by blockwise parallel matrix multiplication, but more advanced statistical procedures possess various parallel data structure mappings. For instance, it has been shown that least squares can be made very efficiently parallel by decomposing the *X*^{T}*X* matrix by columns into disjoint blocks [4]. This can be contrasted with the row-wise splitting that has found success in GLMs and GAMS as special cases of non-linear optimisation [5]. Multi-chain MCMC methods can also benefit from parallelisation, with each chain being run in parallel, though no special ordering need be defined on each chain [6]. A variety of other parallel and distributed statistical procedures with mappings to composite chunk structures have further derivations, most being well-served by variations on the composite chunk[7].

Chunk aggregation through composite chunks exists as an interaction layer above raw singular chunk manipulation. Type safety is added through supplemental child manipulation methods, though this is exists in tension with the aim of a uniform interface, and whether the child manipulation methods exist in the `compositeChunk`

class or the abstract `chunk`

class determines the balance of favour. An example implementation of a simple family of composite chunks is given in fig. 4. Conceivably, further layers may be added to hide details such as the blocking format of matrices, allowing clients to only be concerned with matrices *qua* matrices.

[1]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in *2010 IEEE 26th symposium on mass storage systems and technologies (MSST)*, 2010, pp. 1–10.

[2]

M. Kruckenberg and J. Pipes, “Cluster,” in *Pro MySQL*, Apress, 2005, pp. 617–644. doi: 10.1007/978-1-4302-0048-2_19.

[3]

B. Liskov, “Keynote address - data abstraction and hierarchy,” *ACM SIGPLAN Notices*, vol. 23, no. 5, pp. 17–34, May 1988, doi: 10.1145/62139.62141.

[4]

R. A. Renaut, “A parallel multisplitting solution of the least squares problem,” *Numerical linear algebra with applications*, vol. 5, no. 1, pp. 11–31, 1998.

[5]

N. R. Suri, D. Deodhare, and P. Nagabhushan, “Parallel levenberg-marquardt-based neural network training on linux clusters-a case study.” in *ICVGIP*, 2002, pp. 1–6.

[6]

I. Strid, “Efficient parallelisation of metropolishastings algorithms using a prefetching approach,” *Computational Statistics & Data Analysis*, vol. 54, no. 11, pp. 2814–2835, Nov. 2010, doi: 10.1016/j.csda.2009.11.019.

[7]

G. Guo, “Parallel statistical computing for statistical inference,” *Journal of Statistical Theory and Practice*, vol. 6, no. 3, pp. 536–565, Sep. 2012, doi: 10.1080/15598608.2012.695705.