Home
Introduction
- Real-time flexible and
extensible platform for modelling with
larger-than-memory datasets in R
Package: largeScaleR
Type: Package
Title: Provides a Distributed Framework
for Statistical Modelling
Version: 0.4
Motivation
- Larger-than-Memory datasets
- e.g. taxicab: Monthly updated dataset of Taxi trips from
TLC
- ~2.5Gb/month since 2009 (~1/3Tb)
- Naive result: crash/thrash. Why?
Specifications
- A platform enabling creation of novel models for larger-than-memory
datasets
- Interactive
- Simple to use and setup; minimal difference to using existing
system
- Fast
- Robust
- Extensible
Local Approaches
Using R
disk.frame
File-backed dataframes

Distributed Approaches
Outside of R
- MPI: C, C++, Fortran; de-facto standard for HPC; explicit
point-to-point communications
- Hadoop (HDFS, MapReduce)
- Spark (RDS): Scala; response to Hadoop; RDD, Dataset API
- Dask: Python; Task scheduling, distributed data structures
MapReduce with Hadoop

Distributed Approaches
Using R
- SNOW
- pbdR, pbdDMAT: R frontend to MPI
- SparklyR: R frontend to Spark
SNOW
Split list and map over multiple processes

Preliminary Results
- Initial package development
- Cluster initialisation
- Object distribution
- Distributed Object interaction

Preliminary Results in
Detail
- Queue communication
- Worker evaluation

Preliminary Results in
Detail
distribute()
do.dcall(what, args)
emerge()
- Generics
Main Demonstration
- First 3 months 2011 Taxicab dataset (32M rows)
- Determine total tips
- Create plots of pickup & dropoff locations
- 32 processes over 8 nodes
Challenges
- Evaluation & Alignment: Recycling over distributed arrays
- Asynchrony: Race conditions, dependencies
- Debugging: Distributed errors
Further Work
- Fault Tolerance: P(failure) = 1 − (1−p)n
as n → ∞
- Memory usage optimisation
- Interfacing with other systems
- Benchmarking
GitHub