Now that we've tested our script on small data, let's run it against the full dataset of tweets. We'll use Hadoop's parallel processing to spread the work around a cluster of machines.
When running your code in the cloud, you need to decide how large a cluster of computers to use for your job. For this example, we only have about 0.5 GB of tweets in the cloud (not quite big data), so we don't need a very large cluster. We'll use a 3-node cluster, which should finish the job in under 5 minutes once it starts up. In later tutorials we'll cover more about how Hadoop works and how to determine the right cluster size for your data and your job.
By default this Mortar project uses AWS spot instance clusters to save money. Running this example on a 3-node spot instance cluster for 1 hour should cost you approximately $0.42 in pass-through AWS costs. Before running this job you will need to add your credit card to your account. You can do that on our Billing Page.
When you're ready, run the job on a 3 node cluster:
mortar run pigscripts/tweets_by_time_block.pig -f params/tweets-cloud.params --clustersize 3
After running this command you will see output similar to:
Taking code snapshot... done Sending code snapshot to Mortar... done Requesting job execution... done job_id: some_job_id Job status can be viewed on the web at: https://app.mortardata.com/jobs/job_detail?job_id=some_job_id Or by running: mortar jobs:status some_job_id --poll
This tells you that your job has started successfully and gives you two common ways to monitor the progress of your job.
Your job goes through three main steps after you submit it:
Once your job has started running on the cluster, you can get realtime feedback about its progress from the Mortar web application. Open the job status link displayed after you started your job:
Job status can be viewed on the web at: https://app.mortardata.com/jobs/job_detail?job_id=some_job_id
At the top of the screen you can see the basic details of your job: its current progress, the time the job started, how long it's been running, and the number of nodes in the cluster.
Below that is the Visualization tab. This shows how Pig has compiled your pigscript into Hadoop Map/Reduce jobs. The blinking box shows you the current running Map/Reduce job, and clicking on any box will show you information and statistics for that specific job. There's a lot of information in this visualization, so take a minute to click around and see how things work.
Click on the Details tab to see some information about your job. Mortar keeps track of every parameter value you used and links to an exact snapshot of the code used in Github. Once your job has finished this page will also display a link to your output data on S3.
The remaining tabs show more detailed log information from Pig and Hadoop. Take a minute to look at those tabs and familiarize yourself with what is being shown.
Once your job has finished you can examine the results. In this case Pig has written our results to S3, so you’ll need to download the output files from S3. The easiest way to do that is to use the Details tab from the Job Status page shown above.
Take a look at the results. This time we see more tweets reflected, as we've processed more data:
00 - 01 10 01 - 02 6 02 - 03 2 03 - 04 4 04 - 05 6 05 - 06 7 06 - 07 6 07 - 08 10 08 - 09 15 09 - 10 6 10 - 11 4 11 - 12 2 12 - 13 3 13 - 14 7 14 - 15 12 15 - 16 7 16 - 17 12 17 - 18 8 18 - 19 12 19 - 20 14 20 - 21 8 21 - 22 7 22 - 23 16 23 - 00 10
From the set of tweets we've collected, it appears that the night time is the right time for excitement!