SparkR quick start that works

If you’re following along the SparkR Quick Start, you’ll notice that the instructions are not consistent with a more recent build of Spark. Here are instructions that work for SparkR version 1.4.1 on Linux. YMMV on Spark 1.5.

Now that SparkR has been promoted to the core Spark library, it lives in Spark’s bin directory with the other executables. Let’s define SPARK_HOME as the root of your Spark installation. The only environment variable you need set is JAVA_HOME.


If all goes well, you should see an R console start up followed by Java log messages (prepended with a bunch of INFOs). Finally, you should see a happy welcome message:

Welcome to SparkR!
Spark context is available as sc, SQL context is available as sqlContext

First, take a look around, using the ls() function, and you’ll see exactly two objects, sc and sqlContext, both of which were mentioned in the welcome message.

Now, let’s read in the file and perform some trivial operations. The original documentation says to use the textFile function, but unfortunately that is no longer exported by default. No worries, since R doesn’t have a strong permission system, so we can use it like this:

> text_file <- SparkR:::textFile(sc, '')

Again, you’ll see a number of log messages. Don’t worry, as long as you don’t see any ERRORs or FATALs, there’s no reason to panic. Java is known for being verbose, and this extends to log messages.

> count(text_file)
... Java log messages ...
[1] 98

As the name implies, this function is counting the number of lines in this variable. We can confirm this by using the wc bash command.

$ wc -l 

But what does this object look like? Like Hadoop, the data is distributed and abstracted from the end user. This means that simple commands like head(text_file) won’t work the same as with a native R object. Instead you must use the provided grammar of Spark actions to interact with the RDD. To view a line of the text file thus requires using the take action.

> take(text_file,1)
[1] "# Apache Spark"

The output indicates that the result is a list of length one. How do we view the whole file? It turns out that take is synonymous to head, so to get the whole file means specifying its length, which we obtained above.

> take(text_file, count(text_file))

Now on to some data processing. The first function introduced is a filter operation. Those following my book, Modeling Data With Functional Programming In R, will notice this is one of the canonical higher order functions. Now you see why :) Notice that again we have to specify the function with the ::: notation. These functions should really be exported, and perhaps in 1.5 they are.

> linesWithSpark <- SparkR:::filterRDD(text_file, 
  function(line) grepl("Spark", line))

If you’re eager to see the output of the filter operation, you’ll likely be disappointed. What you’ll see is something like

RRDD[6] at RDD at RRDD.scala:35

Feel free to toss around some bemused expletives. In the meantime, let me explain what’s going on. At this point, Spark has only set up the work to do but hasn’t actually executed it. This is conceptually similar to the computer science notion of a promise. To actually evaluate the job, you need to use the collect action. This will produce an output similar to the following.

> collect(linesWithSpark)
[1] "# Apache Spark"

[1] "Spark is a fast and general cluster computing system for Big Data. It provides"

[1] "rich set of higher-level tools including Spark SQL for SQL and structured"

[1] "and Spark Streaming for stream processing."

At this point, we’ve walked through half the quick start. You should have a basic understanding of the Spark action grammar as well as an understanding of how to process simple jobs in Spark. The next step is to run some other types of computations, including map and reduce jobs. We’ll cover that in a subsequent post.


Get every new post delivered to your Inbox.

Join 435 other followers