Evaluation in a Distributed System for R

Jason Cairns


1 Introduction

The evaluation of an expression in a programming language operates under a set of rules in which a rewriting of the original expression takes place. Of greatest interest when discussing evaluation strategies are the treatment of procedures, and in particular, the parameter-passing strategy and binding strategy for evaluation.

The mechanism of parameter passing can take several different forms, but is categorised at the highest level by whether the parameters passed to a function must be fully evaluated before a function is applied or not [1]. A strategy requiring full evaluation before function application is termed “strict” evaluation, while the converse is termed “non-strict” evaluation. Alternative terminology is “eager” and “lazy” evaluation, respectively[2]. Binding strategy is a dependent concept, in which the form of the parameters passed to the function are considered[3]. Binding strategies falling under strict evaluation involve passing to the function a fully evaluated value, an address to the value, or some other public interface. Non-strict binding strategies typically involve some variation on passing the name of a value that only resolves upon attempted access.

The evaluation strategy doesn’t make any difference to the outcome of an evaluation in many cases, especially among simple pure functions with constant inputs. Consideration of evaluation strategies only became a major research topic upon determining the compilation of a subroutine call with arbitrary expressions rather than pure constants. It becomes more relevant upon engaging in computation with side-effects, where the order of operations may be the inverse of expectations. Furthermore, the creation of ad-hoc control structures is trivial with non-strict binding strategies, as well as infinite data structures and avoidance of certain error conditions[4].

2 Evaluation in R

R possesses an evaluation system which is unusual among most programming languages. Of the discussed parameter passing strategies, it is non-strict, in that arguments are not evaluated before the function they are passed to is applied[5]. The binding strategy falls into the category of “call-by-need”, where parameters are only evaluated upon their reference during the evaluation of the body of the function[6]. Call-by-need is a variation of the more general “call-by-name” strategy. Call-by-name involves passing the name, in some form, to the function, where, in the terminology of lambda calculus, it is beta-reduced with capture-avoiding substitution[7]. Call-by-need is effectively an optimisation of call-by-name; while call-by-name requires evaluation of the parameter every time it is referenced, call-by-need engages in memoisation, caching the value of the first evaluation of each parameter for successive reference.

Interestingly, call-by-need is more prevalent among functional programming languages, including Haskell, having roots in the lambda calculus, though it has been introduced more recently than other more common evaluation strategies[8]. Call-by-need gains unique abilities; most notably, the capacity for non-evaluation of parameters may bring performance improvements. In the extreme case, a function evaluated in a call-by-need manner may terminate in circumstances where strict evaluation will prevent termination. An example is given in lst. 1.

Listing 1: Evaluation of this program will terminate under call-by-need but not terminate under strict evaluation

> x <- function(...) NULL
> x(Sys.sleep(Inf))

Furthermore, the name of the passed parameter is made available through call-by-need and is commonly made use of in R, such as the deparse(substitute()) employed for the implicit labelling of plot axes, among other uses. Other abilities granted by way of call-by-name include the potential to create control structures, short circuit evaluation as typified by boolean operations, and infinite data structures. Drawbacks to call-by-need include potential inefficiencies in the implementation of the strategy, as well as associated complexities and difficulties in debugging[9].

The manner of implementation for call-by-need typically requires a data structure known as a “thunk”[10]. A thunk is a variation on a closure, keeping reference to the passed expression and an environment in which to evaluate the expression upon the thunk being accessed in the body of the procedure. Upon evaluation, the thunk is effectively replaced by its computed value, with further references accessing that value directly. R makes use of this form of processing, with thunks being referred to as “promises”, with the term “promise” not bearing any relation to the standard usage of “promise” to refer to the concurrent data structures associated with futures[11]. A promise in R encapsulates pointers to the value, expression, and environment. Promises are introduced in R through a list of promises replacing the argument list upon a call to a closure. A new environment is constructed for the evaluation of the closure, and the promises are matched to the names of the formal parameters. Evaluation of promises in a function is performed in the evaluation frame of the calling function, rather than the evaluation frame of the function, as default arguments are. Upon evaluation of a promise, the value pointer is set to the resulting value, the environment is set to NULL, and the header of the SEXP underlying the promise has the “PRSEEN” flag set to indicate that it has been accessed, in order to avoid recursive loops. In R nomenclature, the evaluation of a promise is known is “forcing” the promise, and from this point on in the evaluation of the closure, the promise serves the value only, with no further evaluation. As hinted at above, the expression is still capable of being accessed through the substitute() function. Evaluation of a promise only takes place when the value is needed, and promises may themselves refer to other promises, such as when unevaluated arguments are passed through calling functions, and these promises are recursively forced in turn.

Promises are used beyond function calls in R, including use for loading code from packages; loading packages attaches a namespace populated with promises which only evaluate upon access.

An exception to the evaluation strategy described here for R is that the “builtin” functions in R are evaluated strictly, with parameters being evaluated before being passed to the underlying C code.

3 Alternative Methods of Evaluation

At this point it is worth discussing other mechanisms for evaluation, for context.

Strict evaluation is typified by applicative order evaluation, which has rewrite rules corresponding to complete evaluation of all function arguments prior to evaluation of the function[3]. The principal binding strategies include call-by-value, call-by-reference, and object-oriented variations such as call-by-object. Call-by-value involves full evaluation of parameters, then calling the function with copies of the values of the parameters[12]. It is commonly used, with an example being C’s handling of non-pointer data. Call-by-reference requires passing parameter references to a procedure, allowing for mutability[13]. Examples include C’s handling of pointer data, as well as equivalent evaluations in Java and Rust. Notably, functional languages possessing immutable data have no difference in outcome between the two evaluation strategies. Call-by-object involves sharing an object passed as a parameter [14]. This evaluation strategy relates to call-by-reference, though differs in that passing a variable means passing access to the actual object, not access to the callers variable. In this way, a passed parameter is able to be mutated from within a function, though reassignment to the formal parameter doesn’t change the binding in the caller. Many object-oriented languages engage in a call-by-object strategy, including Python and Ruby.

Call-by-name and call-by-need are joined by other forms and mechanisms of lazy evaluation. One such strategy, known as optimistic evaluation, is a variation on call-by-need, wherein a functions parameters are evaluated under some time limit upon the application of that function, with any evaluations exceeding the time limit terminated and switched to a standard call-by-need evaluation strategy [15]. This approach is intended to capture the efficiency benefits of both call-by-need as well as strict evaluation. Another optimisation in lazy evaluation is given by graph reduction. Graph reduction involves treating expressions not only as abstract syntax trees, but as directed acyclic graphs, where multiple invocations of the same expression are depicted once only in the graph and linked together to avoid re-evaluation of the same expression[16]. An equivalency exists with graph reduction and the memoisation of call-by-need.

4 Evaluation in R Distributed Systems

Given that distributing computation necessitates a rearrangement of evaluative computation, a question arises: How does evaluation proceed in a distributed system for R?

The SNOW package may be considered first for its pedigree. One of the fundamental procedures in SNOW is clusterCall()[17]. The evaluation of this procedure consists in first evaluating arguments on the master, with their values sent to slave nodes which execute the function call. This corresponds to a strict evaluation strategy, being an effective call-by-value. There is some nuance though, in that within the function being called on the slave node, a call-by-need strategy is still taking place, due to being applied by the R evaluator, but in practice with all arguments having been evaluated prior to function application, the “strict” classification is dominant. Other functions in SNOW follow a similar form.

Prior prototypes in the largescaleR project developed herein follow a similar form. Evaluation of distributed procedures requires evalutaion of arguments on arbitrary nodes in the network, with the values sent to the node executing the function call. Unlike SNOW, evaluation of arguments on arbitrary nodes in the network enables parallelism in argument evaluation, which is potentially more efficient. This corresponds to a rough call-by-future strategy. However, like SNOW, this effectively remains a strict evaluation strategy, due to requiring full parameter evaluation prior to function application, though still possesses some degree of subtlety. This subtlety leads to a somewhat orthogonal side-point; the form of generation of the function call differs from that in regular R. Promises are still used in the function call, so the originating expression is still in theory attainable. However, prior largescaleR prototypes have been generating function calls through the do.call() function, which constructs a call from a function and a list, then evaluates the call. This results in a mismatch in the original call and the actual call. An example of such a mismatch is given in listing lst. 2

Listing 2: A mismatch in an actual call and one generated with do.call

> x = function(...) match.call()
> y = 1
> x(y)
> do.call(x, list(y))
(function(...) match.call())(1)

This is seen in the issue surrounding call capture in model fitting, expounded upon in a separate document.

Beyond the above point regarding call creation, the method of evaluation in largescaleR prototypes has led to a missing capacity for procedures taking advantage of lazy evaluation, including the termination mechanism discussed in Section sec. 2 and illustrated in Listing lst. 1.

5 Alternatives to Prior Evaluation Strategies

In order to bring future prototypes of the project closer in line with standard R evaluation, a radically different evaluation mechanism is required. The evaluation or collection of the arguments to a function on a node would have to take place only upon those arguments being accessed in the evaluation of the function body. This implies a necessity to take advantage of promise objects. Such a capacity is already inherent in R, with delayedAssign() from the base package granting direct access to promise creation. Upon access of the promise objects, some operation should be enacted to emerge the underlying expression locally, including potential initiation of remote procedures. There are already limitations, however; One example is the execution of a remote distributed call with local arguments – if the call is to take place asynchronously, there is no immediate mechanism for the node executing the call to attain the local arguments as and when (and if) they are needed. This would require evaluation of the arguments before sending off the call. This may be able to be overcome if such local arguments were placed as promises in a local data store running asynchronously that evaluated the arguments prior to sending them off to requesters. All of this remains highly complex – one of the original points of opposition to thunks [10]among compiler constructors was the difficulty and complexity introduced through their usage, with the dearth of debuggers originally available as testament to this notion.

Another point to consider is that lazy evaluation is capable of being emulated in strict languages supporting closures or generators, through caching of values upon the constructs being called. However, this is less relevant to the case under consideration here, as it would require explicit calling of arguments in the definitions of functions to be evaluated in a distributed manner, which is undesirable when aiming for an API with greater conformance to standard R usage.

An argument can be made for leaving distributed evaluation as it is, for the potential efficiency gains in parallel pre-computation over call-by-need, which is serial in the naive case.

The side-point made about call construction in Section sec. 4, and illustrated in Listing lst. 2 may be met through capturing the original call and sending it along with references to the values to the executing node. Upon reception by the executing node, it may construct an outer evaluation frame with the passed variable names given as arguments in the captured expression being bound to their associated values (or distributed thunks)[10], and the expression evaluated within this outer evaluation frame. Variable names of passed parameters may be accessed for binding through code similar to the example given in Listing lst. 3. A drawback of the approach is evaluation of parameters prior to function application, but this may be sidestepped through the C API in a similar manner to the definition of ...length()[3].

Listing 3: Access to variable names given as formal parameters to a function call using x and y as defined in Listing 1

> do.call2 <- function(what, ..., envir = parent.frame()) {
+     outer_frame <- new.env(parent = envir)
+     assign(deparse(substitute(what)), what, pos=outer_frame)
+     parameters <- as.list(match.call())[-1]
+     args <- parameters[!names(parameters) %in% c("what", "envir")]
+     mapply(assign, sapply(args, deparse), list(...),
+            MoreArgs = list(pos=outer_frame))
+     constructed_call <- as.call(c(substitute(what), args))
+     eval(constructed_call, envir=outer_frame)
+ }
> do.call2(x, y)
E. Crank and M. Felleisen, “Parameter-passing and the lambda calculus,” in Proceedings of the 18th ACM SIGPLAN-SIGACT symposium on principles of programming languages, 1991, pp. 233–244. doi: 10.1145/99583.99616.
P. Henderson and J. H. Morris, “A lazy evaluator,” in Proceedings of the 3rd ACM SIGACT-SIGPLAN symposium on principles on programming languages, 1976, pp. 95–103. doi: 10.1145/800168.811543.
H. Abelson and G. J. Sussman, “4.2.1 normal order and applicative order,” in Structure and interpretation of computer programs, Cambridge, Mass. New York: MIT Press McGraw-Hill, 1996.
C. Okasaki, Purely functional data structures. Cambridge, U.K. New York: Cambridge University Press, 1998.
R. C. Team, “Argument evaluation,” in R internals, r-project.org, 2020.
Z. M. Ariola, J. Maraist, M. Odersky, M. Felleisen, and P. Wadler, “A call-by-need lambda calculus,” in Proceedings of the 22nd ACM SIGPLAN-SIGACT symposium on principles of programming languages, 1995, pp. 233–246. doi: 10.1145/199448.199507.
M. J. Fischer, “Lambda calculus schemata,” ACM SIGPLAN Notices, vol. 7, no. 1, pp. 104–109, 1972.
S. L. Nita and M. Mihailescu, “Introduction,” in Practical concurrent haskell: With big data applications, Berkeley, CA: Apress, 2017, pp. 3–11. doi: 10.1007/978-1-4842-2781-7_1.
H. Nilsson, “Tracing piece by piece: Affordable debugging for lazy functional languages,” in Proceedings of the fourth ACM SIGPLAN international conference on functional programming, 1999, pp. 36–47. doi: 10.1145/317636.317782.
P. Z. Ingerman, “Thunks: A way of compiling procedure statements with some comments on procedure declarations,” Commun. ACM, vol. 4, no. 1, pp. 55–58, Jan. 1961, doi: 10.1145/366062.366084.
R. C. Team, “Promise objects,” in R language definition, r-project.org, 2020.
G. D. Plotkin, “Call-by-name, call-by-value and the λ-calculus,” Theoretical computer science, vol. 1, no. 2, pp. 125–159, 1975.
F. Turbak, Design concepts in programming languages. Cambridge, Mass: MIT Press, 2008.
B. Liskov, R. Atkinson, T. Bloom, E. Moss, and C. Schaffert, “CLU reference manual.” MASSACHUSETTS INST OF TECH CAMBRIDGE LAB FOR COMPUTER SCIENCE, 1979.
R. Ennals and S. Peyton Jones, “Optimistic evaluation: A fast evaluation strategy for non-strict programs,” 2003, ACM International Conference on Functional Programming (ICFP’03).Available: https://www.microsoft.com/en-us/research/publication/optimistic-evaluation-fast-evaluation-strategy-non-strict-programs/
P. Hudak, “Conception, evolution, and application of functional programming languages,” ACM Comput. Surv., vol. 21, no. 3, pp. 359–411, Sep. 1989, doi: 10.1145/72551.72554.
L. Tierney, A. J. Rossini, N. Li, and H. Sevcikova, SNOW: Simple network of workstations. CRAN, 2018.