Apache Spark is a general-purpose distributed computing engine for processing and analyzing large amounts of data. Though not as mature as the traditional Hadoop MapReduce framework, Spark offers performance improvements over MapReduce, especially when Spark's in-memory computing capabilities can be leveraged.
In this tutorial we will write a Spark script in Python. When we're done we will run the script on Mortar to test it out.
Spark programs operate on Resilient Distributed Datasets, which the official Spark documentation defines as "a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel." In this tutorial we will create a new RDD from an input data file and then perform several transformations (operations that produce a new RDD from an existing one) and actions (operations that deliver some form of output) to train and validate a machine learning model.
MLlib is Spark's machine learning library, which we will employ for this tutorial. MLlib includes several useful algorithms and tools for classification, regression, feature extraction, statistical computing, and more.
That's the (extremely) high-level overview. We'll get into more specifics in the tutorial itself. Ready to do some machine learning with Spark? Let's get started.