Jason Cairns


While the means by which a distributed system engages in communication typically falls into a description of implementation, the manner of communication has ramifications not only for implementation but for what to expect in terms of capability and efficiency. As such, the very architecture of the system depends upon communication, setting it up as a definitive challenge.

Distributed systems are encountered in myriad forms of communication architecture, with the architecture capable of being represented by graphs. At one extreme, all nodes in the system communicate directly with all others, forming a complete graph. This is the method taken by peer-to-peer systems, having advantages of redundency and resilience. It allows for fine-grained transfer of data and directives, potentially attaining high levels of efficiency in computations that stand to gain from such a form of communication. In addition, peer-to-peer communication consumes significantly less bandwidth than that required for a complete broadcast among all nodes.

Peer-to-peer communication finds its most notable implementation in MPI, primarily through the use of MPI_Send and MPI_Recv functions.

The capacity for peer-to-peer communication enables many applications; Wes Kendall demonstrates random walks and parallel particle tracing as some applications that can take advantage of point-to-point communication [1]. A generalisation of such applications would be any situation where a large amount of static data exists in distributed memory, and nodes perform iterative computation on the main data in conjunction with small pieces of data that are then transferred and received based on the output of the computation.

Applications in the domain of statistical modelling (beyond a random walk) include Bayesian networks, undirected Markov blankets, Hidden Markov Models, even neural networks wherein each node contains several consecutive layers that are a part of a larger network spanned by all of the nodes.

Peer-to-peer communication has the potential to add significant performance capabilities to a distributed system. This comes at the potential for deadlock and other race conditions, which must be carefully managed by both the implementation and the user. A large-scale platform for statistical modelling in R would benefit from such a capability, but introduces several difficulties, including determination of an appropriate interface, a clean implementation avoiding excess communication, as well as new issues with scalability, where additional nodes grow connections at an order of O(n!).

The other extreme of communication architecture is one in which a client machine communicates with worker nodes which in turn don’t communicate with each other. Such an architecture is also known as one-to-many. This format overcomes many of the coordination and scalability issues inherent in a peer-to-peer architecture, with the weakness of introducing a single point of failure and bottleneck that creates its own scale issues. Between these extrema, various forms of intermediary nodes and hierarchical architectures exist.

Another consideration, from the perspective of API, is user knowledge of data locations. In a peer-to-peer system, there needs to be some tracking of chunk locations. A one-to-many system may be sufficiently simplified that all locations are assumed to contain a chunk, with no necessity of tracking individual chunks. The different levels of abstraction seperate knowledge of data location. At the lowest level, every chunk may be uniquely addressed by way of location. A higher level of abstraction may engage in information hiding of location, with the API providing addressing of chunk, and internally determining the location. Higher still is to ignore chunks and locations, with the API hiding both.

SNOW is an example of a one-to-many architecture, which makes use of the highest level of abstraction with respect to data location. Issues relating to SNOW as covered in the previous chapter are largely independent of its architectural basis of communication. However, the intention to simplify operation by way of assuming all nodes hold all relevent chunks, and possessing no notion of location, leads to an inability to perform certain high-level tasks that depend on the lower level communication architecture. Notably, any operation of accumulation on a distributed data source is out of reach for the standard use of the API. This means that a true reduce operation in the functional sense is unable to be performed over the data.

An early prototype in the project overcame such a limitation by keeping track of individual chunks. This was aided by an architectural amendment through indirection. An additional node was inserted between the client and the worker nodes, with all communication passing through this additional node. The principal purpose of such a node was to perform routing between chunks. Rather than the client addressing nodes directly, chunks were addressed, and the router node ensured that the correct node holding the chunk was sent the corresponding job request. With the ability to directly address chunks as a means of communication, higher-level functionality is made available. However, the specific architectural decision to make use of a distinct router node entails the consequence of creating a distinct point of failure. It should be noted that the single point of failure is not an issue correlated with the further development - it’s an even greater issue in the SNOW model, where a lack of communication among worker nodes means a total lack of data redundancy, and thereby any form of resilience against node failure. With a distinct point of failure, the system as a whole faces an increasing probability of failure at scale, and thereby can’t be considered as a genuinely capable large scale system.

This is a major communication challenge to be overcome.

Another component of communication involves heterogeneity in the system. The format of communication, if language-specific, enforces a homgeneity that may possibly be limiting. None of the packages considered enable any language variation within a singular system, despite such a capacity being entirely conceivable, and in certain situations desireable, in order to make use of tools better-suited to specific jobs. To qualify this, some of the underlying systems, such as Hadoop, do allow for command-line map jobs, which could use differing languages from job to job. How to enable a language-agnostic capability is a further definitive challenge.

W. Kendall, “MPI tutorial. Point-to-point communication application - random walk.”