Web Projects let you develop Pig, Python, and Jython scripts directly on the Mortar website without installing anything.
For development, Mortar provides a Web IDE with rich syntax highlighting and integrated development tools. Each Web Project has a single Pig, Python, and Jython script. Mortar stores these scripts for you, and auto-saves them as you edit.
When you're finished developing, you can run your Web Project on a private Hadoop cluster, either through the Mortar website or via the API.
If you don't already have a Mortar account, you can request a free trial here.
If you do have an account, log in at app.mortardata.com.
Click on the Web IDE link at the top of the page.
The Web IDE will create a new example project called "My Web Project: Olympic Data". This example project and the accompanying tutorial help you find the country that won the most medals at the Olympics in 2010.
The editor screen is divided into two areas: Pig and UDFs.
The Pig code drives execution. Pig is a data flow language that is similar to SQL, but it is executed in steps—-it does joins, filters, aggregates, sorts etc.
Scroll through the Pig code--each statement builds on the statement before. If you are familiar with SQL, it should read easily: Load the data from S3, apply a few transformations to the data (you'll learn how by following the tutorial), and store the results back to S3.
For more information and tutorials on Pig, check out Pig Help and Resources.
Python user-defined functions (UDFs) are applied to rows or groups of rows by Pig. Python UDFs enable rich computation (for example: linguistic analysis, sophisticated parsing, advanced math, sound analysis, etc.)
For more information on Python in Mortar, see Python Help and Resources.
Jython UDFs are similar to their Python counterparts, except that they're executed in the same Java Runtime Environment as Pig and Hadoop. This makes streaming data to and from the UDF much quicker than Python. This comes at the tradeoff: Jython cannot use C-based libraries (like numpy and scipy) that Python can use.
For more information about Jython in Mortar, see Jython Help and Resources.
Click the Illustrate button at the top right.
Illustrate checks that your code will run, and will show you a small sample of real data flowing through each step of your script (load, filter, etc.).
Illustrate will take about twenty seconds to do sophisticated sampling. It's much faster than manually curating the subset of the data yourself, and much faster than running a Hadoop job to test your work.
Now let's run the Olympic Data project. Mortar offers two ways to run jobs. The first is in local mode and the second is running on an actual Hadoop cluster that is managed for you by Mortar.
Click the Run button.
Runs in local mode are executed on your dedicated Pig server. This means its a fast way to run small jobs. And even better, it's completely free!
Ensure the Receive email when job finishes option is checked. This way Mortar will notify you by email when your job completes.
Before starting your job, let's take a look at the other way you can run your Mortar jobs.
Select "Start a New Cluster" in the "Hadoop Cluster" dropdown.
In order to run a job on a cluster, you'll first need to add a credit card on the Billing Page. Clusters will be charged at exactly AWS costs with no upcharges.
As you can see, there are a few more options to consider when running on a Hadoop Cluster.
Your first choice is how large of a cluster you should start. The cluster size you need depends a lot on your job and data. A good rule of thumb is to start with a small cluster and see how your job runs, using larger clusters as necessary.
The Spot Instance Clusters option determines if we should launch your cluster on the AWS Spot market. Spot Instances are up to 70% cheaper than On-Demand Clusters. The tradeoff is that they take longer to launch and can (very rarely) disappear before you are finished using them. To learn more, check out the Spot Instance Cluster documentation.
The Keep cluster running after job finishes option allows you to re-use your cluster for future jobs, saving money and allowing them to start up quickly without waiting for a new cluster.
When you choose to keep a cluster running for multiple jobs, its a good idea to manually shut it down when you are done with it. This can be done on the Clusters Page. If you forget, Mortar will automatically shut down the cluster after it has been idle for 1 hour.
Because the example project uses only a small amount of data, we'll save some money and run it in local mode.
Select "No Cluster (Local Mode)" in the "Hadoop Cluster" dropdown.
Click the Run button.
Once you submit your job, you'll see a notification that it has been submitted to run.
To monitor your job's progress, open up the Jobs Page.
Click on your job name to see details.
In addition to job progress, the Job Details page shows you the exact code you ran and logs from your job.
When results are ready, you'll also see links to download them from S3. Any runtime errors your job encounters will be displayed here too.
When your job finishes, click on the link under Results to view your output data.