2021-08-10

Over the course of extensive prototyping, testing, and research towards the construction of an R-based distributed system for large-scale statistical modelling, the “chunk” concept has been imperative to intuitive and clean usage of the system. This document outlines the “chunk”, including underlying theory and implementation suggestions.

The term “chunk” is used here to refer to discrete portions of data that may be distributed across computing nodes. It has its concrete realisation in object-oriented terms as some instance implementing the interface of a “chunk” type, with further details in sec. 2.

The provenance of the concept can be found in many programs operating with distributed partitioning schemes, including MySQL Cluster and the Hadoop Distributed File System [1], [2].

The chunk as a type has at minimum two operations for its interface:
`data()`

, and
`do()`

, as shown
in tbl. 1. These correspond to access and
execution on an abstract chunk, where `data()`

returns the
underlying data encapsulated by the chunk, and `do()`

takes a
function and a variable number of chunk arguments, returning a chunk
representing the result. `do()`

and `data()`

are
intimately cohered, in that the `data()`

function must
be called at some point to access the underlying data for the actual
calling of whatever function is given to the `do()`

function, and
the result of the `do()`

operation can
be accessed only through `data()`

.

Responsibility | Operation | Return Type |
---|---|---|

Data access | `data(chunk)` |
`<data>` |

Execution on Data | `do(function, ...chunk)` |
`<chunk>` |

The implementation of such an interface strongly depends on the fact that the data underlying a chunk may be in one of several different states.

Most notably, an instance of a `chunk`

may be returned by
`do()`

, whose
underlying data may still be computed either concurrently, or at some
point in the future; the limitation of present data availability has
purposely not been placed, in order that concurrent operation scheduling
may be dynamic. With this in mind, the chunk will adopt different
behaviours internally depending on the status of the data; for example,
data that is still being computed will not allow immediate access via
`data()`

, and
may require communications set up in order to be transferred, while
fixed and pre-computed data may be immediately accessible or even cached
locally.

A simple subclassed form of this particular implementation is given in fig. 1.

This meets the Liskov substitution principle[3], however doesn’t retain flexibility for other forms of state; here the whole of the chunk object’s state is not satisfactorily captured entirely by the chunk – there may be other aspects of state, such as a command history, last cache, or others, that would demand subclasses for consistency, yet lead to a multiplicative proliferation of subclasses if implemented.

Therefore, an alternative is presented, wherein the data state is
delegated and thereby encapsulated, as given in fig. 2. Assuming there is some set of mutually
exclusive state spaces *S*, each
containing some set of states *N*_{S}, the growth
of required subclasses to capture interacting state is given by 𝒪(∑_{S}*N*_{S})!,
whereas delegation leads to growth proportional only to the number of
state spaces, at 𝒪(*S*).

An example of additional states to motivate the need for multiple forms of state-capture is given in the following example: In the service of data resiliance in the face of likely machine failure, the underlying data on the machine can be written to disk and restored following failure. Assuming immutability of the data, once it has been written, it has no need to be re-written. The only data that requires writing is that resulting from operations on the original data. Imagining a log is held of data operations, with one datum resulting in another through some operation, long chains of parent-child data may be created. Depending on the granularity, portions of these chains may heuristically be skipped for writing to disk, in order to save time, where minimisation of disk write time is to be balanced by the probability of machine failure over the time spent performing operations. The maintenance of a command history and current state of storage of data, may be captured in the chunk by another form of state, which can separately trigger further disk writes. This is depicted in fig. 3

With respect to all of the class descriptions, this remains only a general description of the architecture; non class-based approaches that still follow a similar pattern of delegation may very well be preferable, especially given that the language of choice for implementation is , which possesses many dynamic features that can substitute for these state object compositions, such as closures as part of higher-order functions.

The value of distributing objects across machines finds greater value
when an object larger than the memory limits of any remote machine can
be split up into multiple chunks that each hold some constituent portion
of the underlying data. The notion of a *composite chunk* is
therefore introduced, in order to capture the idea of an object being
composed of multiple chunks. This is equivalent in many respects to the
*distributed object* as used in earlier prototypes, but serves a
far more general role.

A composite chunk is a special type of chunk that possesses an
ordered list of children, and can therefore be recursively composed in a
tree structure. Methods for combination of children, which either
overrides the `data()`

method or is
delegated thereto, would be required. It is conceivable that the same
could be required of the `do()`

method, where
different representations of chunk composition would alter standard
requested functions. Example variations of composite chunks include
row-block matrices, column-block matrices, ordered lists, unordered
lists (sets), etc. The implementation of a new composite chunk can
involve the creation of a new subclass of composite chunk, and perhaps
some subclass variation of an atomic chunk.

The flexibility granted by such variations in composition allow for
greater matching of data structures to parallel statistical procedures.
The simplest example is given by blockwise parallel matrix
multiplication, but more advanced statistical procedures possess various
parallel data structure mappings. For instance, it has been shown that
least squares can be made very efficiently parallel by decomposing the
*X*^{T}*X*
matrix by columns into disjoint blocks [4]. This can be contrasted with the
row-wise splitting that has found success in GLMs and GAMS as special
cases of non-linear optimisation [5]. Multi-chain MCMC methods can also
benefit from parallelisation, with each chain being run in parallel,
though no special ordering need be defined on each chain [6]. A
variety of other parallel and distributed statistical procedures with
mappings to composite chunk structures have further derivations, most
being well-served by variations on the composite chunk[7].

Chunk aggregation through composite chunks exists as an interaction
layer above raw singular chunk manipulation. Type safety is added
through supplemental child manipulation methods, though this is exists
in tension with the aim of a uniform interface, and whether the child
manipulation methods exist in the `compositeChunk`

class or
the abstract `chunk`

class determines the balance of favour.
An example implementation of a simple family of composite chunks is
given in fig. 4. Conceivably, further
layers may be added to hide details such as the blocking format of
matrices, allowing clients to only be concerned with matrices
*qua* matrices.

[1]

K.
Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop
distributed file system,” in *2010 IEEE 26th symposium on mass
storage systems and technologies (MSST)*, 2010, pp. 1–10.

[2]

M.
Kruckenberg and J. Pipes, “Cluster,” in *Pro
MySQL*, Apress, 2005, pp. 617–644. doi: 10.1007/978-1-4302-0048-2_19.

[3]

B.
Liskov, “Keynote address - data abstraction and hierarchy,”
*ACM SIGPLAN Notices*, vol. 23, no. 5,
pp. 17–34, May 1988, doi: 10.1145/62139.62141.

[4]

R.
A. Renaut, “A parallel multisplitting solution of the least
squares problem,” *Numerical linear algebra with
applications*, vol. 5, no. 1, pp. 11–31, 1998.

[5]

N.
R. Suri, D. Deodhare, and P. Nagabhushan, “Parallel
levenberg-marquardt-based neural network training on linux clusters-a
case study.” in *ICVGIP*, 2002, pp. 1–6.

[6]

I.
Strid, “Efficient parallelisation of
metropolishastings algorithms using a prefetching
approach,” *Computational Statistics & Data
Analysis*, vol. 54, no. 11, pp. 2814–2835, Nov. 2010, doi: 10.1016/j.csda.2009.11.019.

[7]

G.
Guo, “Parallel statistical computing for statistical
inference,” *Journal of Statistical Theory and Practice*,
vol. 6, no. 3, pp. 536–565, Sep. 2012, doi: 10.1080/15598608.2012.695705.