Tags
Alliterations aside, here is a preview of something I’ve been tinkering with. My goal is to be able to run R code as a phase within a Riak map/reduce job. In a multi-cultural world filled with distinct languages, it should be obvious that one size does not fit all. In the case of erlang, statistics is not its strong suit. Writing a sparse matrix class is bad enough, but imagine implementing regression or random matrix theory. For its part and despite many honorable attempts, R isn’t great at distributed processing. So waving the banner of bringing the processing to the data, why not use R to process portions of a map/reduce job?
This actually isn’t as hard as it sounds. Below are a few snippets of running R code via an erlang RPC. This means that R is available and running as an erlang node!
First, we are calling the R function ‘mean’ to calculate the arithmetic mean of the list of numbers
<pre>(test@localhost)57> rpc:call('rchimedes@localhost', rchimedes, eval, {mean, [[10,12,13,25,20]]}). {ok,{16.0}}</pre>
Next we’ll get samples from a random normal distribution. To me, calling rnorm is analogous to Hello, World for R.
(test@localhost)58> rpc:call('rchimedes@localhost', rchimedes, eval, {rnorm, [10]}). {ok,{-1.3440940467953522,1.0346333094171907, -2.7704297093573698,0.32721935800723084,1.6406162089066918, -0.480623709693892,-1.4687159958435285,-0.4415948361775166, -1.2729869815762578,0.8369905573667532}}
Currently the syntax is structured to use atoms as function references (i.e. the function must exist in R space) and binary strings as function defintions. Notice that the arguments passed to the function are sent in a list. This is standard erlang to support additional arguments for the remote function call. For example, lets say we want to pull from a normal distribution with mean 5:
(test@localhost)60> rpc:call('rchimedes@localhost', rchimedes, eval, {rnorm, [10,5]}). {ok,{4.939374253203547,5.2481766179207545,6.413720221228998, 5.679098487985773,6.371656468561924,5.572533109697437, 4.196247547549403,5.36443397342678,3.7423040151803044, 6.979719956460093}}
The above examples hopefully whet your appetite for what is possible here. The next step in the exercise is to execute from a Riak job and pull it all together in a complete job. Any ideas on case studies are welcome. Otherwise, brace yourself for something finance related.
This is super exciting. I’ve been thinking about combining Erlang and R for quite a while, but have not gotten around to actually doing it. Instead of fiddling around with Rmpi et al., it is much more natural to combine R with Erlang! Let us know your results and post some more code as time goes by. Thanks!
LikeLike
Thanks for the encouragement. I have to work through the details of riak_pipe as the Basho folks recommended that as the best route for integration. Once riak 1.0 is released, I imagine that exercise will be simpler.
LikeLike
It would also be interesting to compare your approach (R+Riak+Erlang) with what is already out there, like hive (https://r-forge.r-project.org/R/?group_id=409), R interfact to Hadoop Streaming (https://r-forge.r-project.org/R/?group_id=387), and RHIPE (http://ml.stat.purdue.edu/rhipe/). This is not meant to start a flame war. It’s just that all of these approaches are relatively new, and I think it would be extremely useful to compare them just to get an overview.
Also, since you’ve mentioned suggestions for a case study in your article. Probably this is asking too much, but it would be interesting to see a text mining application. Something along the lines of tm.plugin.dc (http://cran.r-project.org/web/packages/tm.plugin.dc/).
LikeLike
Good idea. I’m working on more of a generic server structure to support both Riak integration via an R-based erlang node as well as an AMQP-based listener for more stream-based processing. With Riak Pipe, stream-based processing should also be a possibility, although the details of the event loop would be different.
Regarding suggestions, I’ll take a look. I’ve been reading up on word collocations, so maybe there is a connection there.
LikeLike