Recall that we set aside 30% of our total dataset for testing our model. Now we need to run the testing data, which has already been transformed and hashed, through the model we just fit to our training data to see how well it performs. We’ll use the built-in structure of LabeledPoint to access the “label” and “features” of each point.
ACTION: Create a new RDD with the predicted category, alongside the actual category for comparison:
# Compare predicted labels to actual labels prediction_and_labels = test_hashed.map(lambda point: (model.predict(point.features), point.label))
Now we have a few thousand tuples, each one representing a single newsgroup posting, with the actual category of the posting and the category that our model predicted based on the input data. Let's tally up how often our model's predictions were correct.
ACTION: Filter our RDD of labels and predictions to only the predictions that were correct, then count all the correct predictions and divide them by the total number of predictions. Finally, print a line of output with the accuracy rate:
# Filter to only correct predictions correct = prediction_and_labels.filter(lambda (predicted, actual): predicted == actual) # Calculate and print accuracy rate accuracy = correct.count() / float(test_hashed.count()) print "Classifier correctly predicted category " + str(accuracy * 100) + " percent of the time"
It's time to run the whole script to see how well the model works.
ACTION: Save the Spark script and run it on Mortar:
mortar spark sparkscripts/text-classifier.py
If you already have a cluster running, it should take three or four minutes to run the entire sequence of steps. How well did your model perform? If everything worked properly, your classifier should have made accurate predictions about 80-85% of the time. Not too bad for a simple example! For more information about Spark and MLlib, check out our Spark resources page.