2022-10-31
More on: - Call-by-value - No scope - Chunk-wise argument alignment - Error interrupt only when emerged to master
The evaluation of an expression in a programming language operates under a set of rules in which a rewriting of the original expression takes place. Of greatest interest when discussing evaluation strategies are the treatment of procedures, and in particular, the parameter-passing strategy and binding strategy for evaluation. Error-handling is a related topic of evaluation, specifically relating to the concrete case of halting any further evaluation until handled. Adjacent to evaluation is the concept of scope, where a set of rules determines the referent entity of a particular name.
The mechanism of parameter passing can take several different forms, but is categorised at the highest level by whether the parameters passed to a function must be fully evaluated before a function is applied or not [1]. A strategy requiring full evaluation before function application is termed “strict” evaluation, while the converse is termed “non-strict” evaluation. Alternative terminology is “eager” and “lazy” evaluation, respectively[2]. Binding strategy is a dependent concept, in which the form of the parameters passed to the function are considered[3]. Binding strategies falling under strict evaluation involve passing to the function a fully evaluated value, an address to the value, or some other public interface. Non-strict binding strategies typically involve some variation on passing the name of a value that only resolves upon attempted access.
The challenge of evaluation in a distributed system is the question of how to handle arguments, errors, and the order of computation.
In many cases, the evaluation strategy doesn’t make any difference to the outcome of an evaluation, especially among simple pure functions with constant inputs. Consideration of evaluation strategies only became a major research topic upon determining the compilation of a subroutine call with arbitrary expressions rather than pure constants. It becomes more relevant upon engaging in computation with side-effects, where the order of operations may be the inverse of expectations. Furthermore, the creation of ad-hoc control structures is trivial with non-strict binding strategies, as well as infinite data structures and avoidance of certain error conditions[4].
R possesses an evaluation system which is unusual among most programming languages. Of the discussed parameter passing strategies, it is non-strict, in that arguments are not evaluated before the function they are passed to is applied[5]. The binding strategy falls into the category of “call-by-need”, where parameters are only evaluated upon their reference during the evaluation of the body of the function[6]. Call-by-need is a variation of the more general “call-by-name” strategy. Call-by-name involves passing the name, in some form, to the function, where, in the terminology of lambda calculus, it is beta-reduced with capture-avoiding substitution[7]. Call-by-need is effectively an optimisation of call-by-name; while call-by-name requires evaluation of the parameter every time it is referenced, call-by-need engages in memoisation, caching the value of the first evaluation of each parameter for successive reference.
Interestingly, call-by-need is more prevalent among functional programming languages, including Haskell, having roots in the lambda calculus, though it has been introduced more recently than other more common evaluation strategies[8]. Call-by-need gains unique abilities; most notably, the capacity for non-evaluation of parameters may bring performance improvements. In the extreme case, a function evaluated in a call-by-need manner may terminate in circumstances where strict evaluation will prevent termination. An example is given in lst. 1.
Listing 1: Evaluation of this program will terminate under call-by-need but not terminate under strict evaluation
> x <- function(...) NULL
> x(Sys.sleep(Inf))
Furthermore, the name of the passed parameter is made available
through call-by-need and is commonly made use of in R, such as the
deparse(substitute())
employed for the implicit labelling
of plot axes, among other uses. Other abilities granted by way of
call-by-name include the potential to create control structures, short
circuit evaluation as typified by boolean operations, and infinite data
structures. Drawbacks to call-by-need include potential inefficiencies
in the implementation of the strategy, as well as associated
complexities and difficulties in debugging[9].
The manner of implementation for call-by-need typically requires a
data structure known as a “thunk”[10]. A thunk is a variation on a
closure, keeping reference to the passed expression and an environment
in which to evaluate the expression upon the thunk being accessed in the
body of the procedure. Upon evaluation, the thunk is effectively
replaced by its computed value, with further references accessing that
value directly. R makes use of this form of processing, with thunks
being referred to as “promises”, with the term “promise” not bearing any
relation to the standard usage of “promise” to refer to the concurrent
data structures associated with futures[11]. A promise in R encapsulates
pointers to the value, expression, and environment. Promises are
introduced in R through a list of promises replacing the argument list
upon a call to a closure. A new environment is constructed for the
evaluation of the closure, and the promises are matched to the names of
the formal parameters. Evaluation of promises in a function is performed
in the evaluation frame of the calling function, rather than the
evaluation frame of the function, as default arguments are. Upon
evaluation of a promise, the value pointer is set to the resulting
value, the environment is set to NULL, and the header of the SEXP
underlying the promise has the “PRSEEN” flag set to indicate that it has
been accessed, in order to avoid recursive loops. In R nomenclature, the
evaluation of a promise is known is “forcing” the promise, and from this
point on in the evaluation of the closure, the promise serves the value
only, with no further evaluation. As hinted at above, the expression is
still capable of being accessed through the substitute()
function. Evaluation of a promise only takes place when the value is
needed, and promises may themselves refer to other promises, such as
when unevaluated arguments are passed through calling functions, and
these promises are recursively forced in turn.
Promises are used beyond function calls in R, including use for loading code from packages; loading packages attaches a namespace populated with promises which only evaluate upon access.
An exception to the evaluation strategy described here for R is that the “builtin” functions in R are evaluated strictly, with parameters being evaluated before being passed to the underlying C code.
At this point it is worth discussing other mechanisms for evaluation, for context.
Strict evaluation is typified by applicative order evaluation, which has rewrite rules corresponding to complete evaluation of all function arguments prior to evaluation of the function[3]. The principal binding strategies include call-by-value, call-by-reference, and object-oriented variations such as call-by-object. Call-by-value involves full evaluation of parameters, then calling the function with copies of the values of the parameters[12]. It is commonly used, with an example being C’s handling of non-pointer data. Call-by-reference requires passing parameter references to a procedure, allowing for mutability[13]. Examples include C’s handling of pointer data, as well as equivalent evaluations in Java and Rust. Notably, functional languages possessing immutable data have no difference in outcome between the two evaluation strategies. Call-by-object involves sharing an object passed as a parameter [14]. This evaluation strategy relates to call-by-reference, though differs in that passing a variable means passing access to the actual object, not access to the callers variable. In this way, a passed parameter is able to be mutated from within a function, though reassignment to the formal parameter doesn’t change the binding in the caller. Many object-oriented languages engage in a call-by-object strategy, including Python and Ruby.
Call-by-name and call-by-need are joined by other forms and mechanisms of lazy evaluation. One such strategy, known as optimistic evaluation, is a variation on call-by-need, wherein a functions parameters are evaluated under some time limit upon the application of that function, with any evaluations exceeding the time limit terminated and switched to a standard call-by-need evaluation strategy [15]. This approach is intended to capture the efficiency benefits of both call-by-need as well as strict evaluation. Another optimisation in lazy evaluation is given by graph reduction. Graph reduction involves treating expressions not only as abstract syntax trees, but as directed acyclic graphs, where multiple invocations of the same expression are depicted once only in the graph and linked together to avoid re-evaluation of the same expression[16]. An equivalency exists with graph reduction and the memoisation of call-by-need.
Given that distributing computation necessitates a rearrangement of evaluative computation, a question arises: How does evaluation proceed in a distributed system for R?
The SNOW package may be considered first for its pedigree. One of the
fundamental procedures in SNOW is clusterCall()
[17]. The
evaluation of this procedure consists in first evaluating arguments on
the master, with their values sent to slave nodes which execute the
function call. This corresponds to a strict evaluation strategy, being
an effective call-by-value. There is some nuance though, in that within
the function being called on the slave node, a call-by-need strategy is
still taking place, due to being applied by the R evaluator, but in
practice with all arguments having been evaluated prior to function
application, the “strict” classification is dominant. Other functions in
SNOW follow a similar form.
Prior prototypes in the largescaler project developed herein follow a
similar form. Evaluation of distributed procedures requires evaluation
of arguments on arbitrary nodes in the network, with the values sent to
the node executing the function call. Unlike SNOW, evaluation of
arguments on arbitrary nodes in the network enables parallelism in
argument evaluation, which is potentially more efficient. This
corresponds to a rough call-by-future strategy. However, like SNOW, this
effectively remains a strict evaluation strategy, due to requiring full
parameter evaluation prior to function application, though still
possesses some degree of subtlety. This subtlety leads to a somewhat
orthogonal side-point; the form of generation of the function call
differs from that in regular R. Promises are still used in the function
call, so the originating expression is still in theory attainable.
However, prior largescaleR prototypes have been generating function
calls through the do.call()
function,
which constructs a call from a function and a list, then evaluates the
call. This results in a mismatch in the original call and the actual
call. An example of such a mismatch is given in listing lst. 2
Listing 2: A mismatch in an actual call and one generated with do.call
> x = function(...) match.call()
> y = 1
> x(y)
x(y)
> do.call(x, list(y))
function(...) match.call())(1) (
This is seen in the issue surrounding call capture in model fitting, expounded upon in a separate document.
Alternative means of performing evaluation is conceivable. For
instance, a manner of evaluation similar to standard R could be enabled
through making use of distributed promises. The evaluation or collection
of the arguments to a function on a node would have to take place only
upon those arguments being accessed in the evaluation of the function
body. This implies a necessity to take advantage of promise objects.
Such a capacity is already inherent in R, with delayedAssign()
from
the base package granting direct access to promise creation. Upon access
of the promise objects, some operation should be enacted to emerge the
underlying expression locally, including potential initiation of remote
procedures. There are already limitations, however; One example is the
execution of a remote distributed call with local arguments – if the
call is to take place asynchronously, there is no immediate mechanism
for the node executing the call to attain the local arguments as and
when (and if) they are needed. This would require evaluation of the
arguments before sending off the call. This may be able to be overcome
if such local arguments were placed as promises in a local data store
running asynchronously that evaluated the arguments prior to sending
them off to requesters. All of this remains highly complex – one of the
original points of opposition to thunks [10]among compiler constructors was the
difficulty and complexity introduced through their usage, with the
dearth of debuggers originally available as testament to this
notion.
Another point to consider is that lazy evaluation is capable of being emulated in strict languages supporting closures or generators, through caching of values upon the constructs being called. However, this is less relevant to the case under consideration here, as it would require explicit calling of arguments in the definitions of functions to be evaluated in a distributed manner, which is undesirable when aiming for an API with greater conformance to standard R usage.
The side-point made about call construction, and illustrated in
Listing lst. 2 may be met through capturing
the original call and sending it along with references to the values to
the executing node. Upon reception by the executing node, it may
construct an outer evaluation frame with the passed variable names given
as arguments in the captured expression being bound to their associated
values (or distributed thunks)[10], and the expression evaluated
within this outer evaluation frame. A drawback of the approach is
evaluation of parameters prior to function application, but this may be
sidestepped through the C API in a similar manner to the definition of
...length()
[3].
Another aspect to consider with distributed arguments is that of alignment. A node may only engage in computation with data that resides on it at the point of computation. In order to attain the data, the data must exist somewhere in the cluster and possibly be transferred to the computing node. The arrangement of the underlying chunks corresponding to the distributed arguments to some function must be ordered in a very specific manner so that the relevent chunks which interact with each other, find themselves eventually placed on the same node. For instance, consider a computation involving two lists consisting of multiple elements, where the computation only requires the interaction between elements at the same index at any one time. In this case, each set of elements paired by index are required to be on the same node, but there are no further requirements regarding location. Such a computation is trivial to parallelise and distribute, with only one such requirement. The notion of ‘recycling’ is also pertinent to alignment. Many functions in R will repeat the elements in vector arguments to match the length of the longest. Emulating such functionality is decidedly non-trivial in a distributed system, however. The naive approach of repeating chunks breaks down entirely when chunks don’t hold a perfectly linear correspondence to the number of elements in the represented data structure. As such, the data structure has to be queried for its lengths, and chunks selected from and rearranged in order to perform this. Selection of elements also suffers from precisely this issue; knowing where the nth element lies in a distributed object requires a querying of chunks from the beginning onwards until it is found.
Errors also pose a significant challenge. Upon encountering one in a standard session, evaluation is halted and directed by the specific handling of the error. In a distributed system, errors take different forms, and can be impossible to detect until long after they take place. Errors may be captured, but this is insufficient if the information is not propogated back to the client node from whom the faulty request originated. Were every single computation to be monitored for errors and the result propagated to the client, this would represent a piece of infrastructure that has the potential to cause major inefficiencies if it involves any significant comunication or blocking. As such, an appropriate mechanism for handling errors is a non-trivial challenge to be solved. The simplest solution may be to capture errors on the worker, and raise them on the master only when the relevant distributed object is emerged to master.
Scope becomes relevant when variables are referred to which originate outside of a particular block of program code. Scope typically falls into one of two categories: in lexical scope, the resolution of an identifier is based on the location at which a particular function is defined. This is contrasted with dynamic scope, where resolution depends on the specific state of the program at the encounter of a particular name. R possesses lexical scoping rules, one of the original differentiators from its syntactic basis, the S language. Most modern languages possess lexical scope. An important component of the implementation of lexical scope is some means of access to the environment of a function through which to search for the referent of a name as it is encountered. This is reasonably straightforward when a chain of environments are located in the same process as a function, but becomes extremely difficult to manage efficiently when this is not the case. The naive approach of sending a full environment chain with every single function requires potentially enormous transfers of data (as contained in the environments) with every function invocation, as such being completely impractical. Performing static analysis on the function code, and only sending the referenced entities is also impractical, given that referenced entities can’t truly be known without actually evaluating the function. It is conceivable that a means could be devised where a worker node evaluating a function, upon encountering a name defined outside of the function, could send a request to the master to perform the search in lexical fashion, and continue execution upon successful reception of the entity sent from the master, however such an operation relies on a level of concurrency that does not exist in the R language. When only variables existing within the function frame are made use of, most problems of scope become irrelevant, so the problem can be temporarily swept under the rug.