Now that our data is loaded into Spark and the text has been cleaned up, we're going to use the hashing trick to turn the text of our newsgroup postings into numerical vectors that a machine learning model can work with. MLlib has a utility called HashingTF to do just that. HashingTF transforms a bag of words into a vector of term frequencies by applying a hash function to each term. Because the vector has a finite number of elements, it's possible that two terms will map to the same hashed term, meaning that the hashed, vectorized features may not exactly represent the actual content of the input text. So we'll set up a relatively large feature vector, accommodating 50,000 different hashed values, to reduce the chance of these collisions.
One thing to note: MLlib uses a data type called LabeledPoint for supervised learning tasks. A LabeledPoint is a vector (in this case, our hashed text features) and a category label. So as we're extracting our features, we're going to map our RDD of input data into a new RDD of LabeledPoints.
Finally, this RDD is a bit of a special case, since we'll soon split it into multiple RDDs. So we're going to store the RDD in memory so that Spark won't have to recompute it from scratch each time it's used downstream.
ACTION: Map our tokenized RDD to a new RDD, in which each item is a LabeledPoint with a category label and a vector of hashed features, then ask Spark to persist our new RDD in memory for later reuse:
# Hashing term frequency vectorizer with 50k features htf = HashingTF(50000) # Create an RDD of LabeledPoints using category labels as labels and tokenized, hashed text as feature vectors data_hashed = data_cleaned.map(lambda (label, text): LabeledPoint(label, htf.transform(text))) # Ask Spark to persist the RDD so it won't have to be re-created later data_hashed.persist()
Now it's time to split our data into two sets: one for training the model, and a smaller, non-overlapping data set for testing how well the model works on new data.
To accomplish this, we'll use
randomSplit, which, as the name suggests, randomly splits an RDD into new RDDs using proportions we provide.
ACTION: Split the RDD of transformed input data into two RDDs: 70% for training and 30% for testing:
# Split data 70/30 into training and test data sets train_hashed, test_hashed = data_hashed.randomSplit([0.7, 0.3])
Now our training data is ready to be fed into a machine learning model. MLlib supports two different approaches for multi-class classification (as opposed to binary classification). We'll use Naive Bayes for this example, because it's effective and simple to implement for a problem of this type. MLLib also supports multi-class Decision Trees, which take more parameters than Naive Bayes classifiers.
ACTION: Train a new Naive Bayes model on the RDD of LabeledPoints we just created:
# Train a Naive Bayes model on the training data model = NaiveBayes.train(train_hashed)
You've now created an RDD, run a bunch of data transformations, done some feature extraction, and trained a machine learning model. Next let's see how well the model works.