Definitive Challenges

Jason Cairns


The development of the project is defined principally by the challenges faced, each introducing their attendant constraints. Defining challenges in turn are formed through the tension held by the requirements, as well as specific technical challenges.

From the very outset, the problem statement held within its own predication certain competing elements. Most notably, this included the tradeoff between useability and functionality. While these two facets often form a symbiotic relationship, with increasing useability positively influencing greater functionality, in many cases they are mutually exclusive, as shown by some of the packages reviewed in the previous chapter. The problem statement was a bold response to many other systems that choose one factor at the total expense of the other; as an example, and to generalise, Sparklyr and SNOW are highly useable. This largely results from a simple model that can be retained without too much difficulty by users. However, they lack certain functionality that many other packages provide alongside their concomitant complexity (equally concomitant: a lack of users) – examples include rmpi and pbdR. The simple system model is not the only aspect that aids useability, and in most cases the reality is far more complex at the level of implementation details, as both Sparklyr and SNOW attest to. The specific API provided by the package has a gatekeeping effect on useability – a good API seems to disappear when used, whereas a poor API renders the usability of a package null. Combined, a clear model and a quality API render a high level of useability. Sparklyr and SNOW both achieve this. Sparklyr takes advantage of the dplyr interface which has been demonstrated as a quality API for many users. It adds only the minimal additional functions required to support the Spark model, with the majority of functions made use of by the developer being methods written for the Spark dataframe using the dplyr generics. SNOW delivers usability in its API by way of limiting the set of possible operations. Rather than providing an operation for every potentiality desired by an end-user, SNOW has limited its offering to a minimal spanning set of operations, with the user then creating the categorical connection between the SNOW functions. This is a step back in terms of API ease-of-use relative to Sparklyr, but it does match the operating model better. On the other end of the scale, rmpi and pbdR boast enormous API’s, with exceptionally complex operating models, along with greater functionality, such as development of arbitrary iterative models. Such is the context for the main competing elements inherent to the problem statement – these elements to be carefully straddled by the largescaler framework.

The most obvious difference between the example packages is in the capacity to control individual node intercommunication. Sparklyr and SNOW have no means of individual node control. Operations occur cluster-wide and via pre-defined means. Conversely, rmpi and pbdR allow for node-specific operations. The challenge rendered by the tension in the problem statement is therefore further elucidated: allowing for a simple operational system model, as well as a clean interface, on top of a functionality that extends to individual node control. None such package reviewed in the preceding chapter marries these together successfully, the failings elucidated element-wise.

These concepts represent a fundamental difference in understanding between developers of the packages as well, with two major camps reflecting an imperative and functional approach to programming problems. The camp referred to informally as “functional” places a greater emphasis within the API on commanding the system what needs to be done. Conversely, the API’s from the “imperative” group focuses on telling the system how to do whatever needs to be done. While the imperative approach may aid in efficiency in many cases, for most jobs the end-user simply wants to deliver high-level instructions and have the system work the details out for them.

The tension between these two concepts fractally grows as the technical components of their solution are traversed. Such a traversal forms a significant piece of work within this project, and the numerous prototypes that were developed will be referenced as their definitive challenges that each overcame and created are described.

The challenges to be considered include:

  1. Object System
  2. Communication
  3. Concurrency
  4. Evaluation
  5. Mutability
  6. Garbage Collection

They will be discussed briefly here, then in more detail over the subsections below.

Overall, the central challenges considered are remarkably similar to those facing the design of computer languages. This is somewhat surprising, given that the design of a computer language was not part of the problem statement. However, the initial impression of the paradox is resolved in the fact that an API was a core piece of the problem; the API, the highest level of abstraction, in turn determining all other components of the problem statement. Most of the basic issues facing big data computation have solutions that are well-enough established now, so this is not the central problem being solved. The work involved in enabling big data computation was putting the solutions together in a streamlined manner. The direction of this work was defined by the API, which was the most novel of all work done. As such, while there were technical challenges relating to the distributed system, which will be discussed as encountered in the relevant sections, the overarching, definitive challenges, were traced out by the design of the API.