In this tutorial we will implement a text classifier with Spark to predict which newsgroup a message belongs to by analyzing the text of that message. Along the way you will learn how to work with Spark and with Resilient Distributed Datasets (RDDs), the parallelized collections that Spark acts on. We will also do a little bit of natural language processing (NLP) with Python to transform the raw text into clean input data that the classifier can efficiently analyze.
Note: As you may know, Spark can be unstable at large data volumes, so we are not providing Spark on Mortar access to all customers at this time. Over the coming weeks and months we will also be adding additional features for Spark-specific support. If you'd like to give Spark a spin with your Mortar account, please drop us a note.