R is still one of the best languages to operate data with its profound amount of built-in analysis and plotting functions. I personally enjoy using it to extract insights from csv/excel files. After reading files into R's data frame type, you can basically do all sort of SQL-Query-like operations on it, including joining csv files like joining tables. (I will get to it later in other posts, but let's focus on MongoDB first.)

So, let's say that you have a MongoDB with certain shards on it. When you wish to do some data analysis, you have to log into your mongodb router, use 'mongo --port 27018 your_collection' command to get into MongoDB console, and start writing your MongoDB queries.

This works fine. However, MongoDB console just does not have any plotting feature. Nor does it have  an easy way for you to extract only 1/or a few columns of data and do analysis on it. Surely you can use 'mongoexport --csv' to export the data into the csv format, and switch to other platforms or languages like R or python for the data analysis part.

This sucks.

In fact, R has a very cool stuff called RStudio Server. You are able to turn your MongoDB router into a R web application, and do your data analysis right on top of it.

RStudio Server Screenshot


ssh into your MongoDB router, and download RStudio Server.

After you are done with the download and installation, your RStudio Server should be running on your 8787 port.

Tunnel into your machine like this:
ssh -L8787:localhost:8787 -i my_key root@my_ec2_machine.compute-1.amazonaws.com

You should be able to see the login when you visit http://localhost:8787 with your browser.

You must login with your machine credential. You also can create a user just for RStudio with linux commands: useradd.

Now, the only thing left in the environment side is the driver for R to connect to your MongoDB. I personally like rmongodb. You should be able to download and install it directly in the RStudio console with a simple R command like this:

Writing Up Your MongoDB Scripts

You are all set. It's time to write up some convenient functions to turn mongodb queries into a R data.frame so you can do all sorts of sweet things to the data.

Here comes my example. The code should be very strait-forward. You are welcome to modify it to fit your case.
## The collection structure looks like this:
##    "col1": "some value 1",
##    "col2": 1234,
##    "col3": "somevalue 3",
##    "col4": "some value 4",
##    "col5": "some value 5",
##    "col6": [
##        {
##            "some": 1,
##            "other": "other value 1"
##        },
##        {
##            "some": 0,
##            "other": "other value "
##        }
##    ],
##    "col6_1": 456,
##    "col6_2": 6789,
##    "col7": "some value 7"

## Here comes the way to conduct your query:
## my_data_frame <- queryIntoDataFrame('col2', 1234)

getCountsByMatch <- function(column_name, column_value){
  mongo <- mongo.create(host='localhost:27018' , db='my_mongodb')
  buf <- mongo.bson.buffer.create()
  mongo.bson.buffer.append(buf, column_name, column_value)
  query <- mongo.bson.from.buffer(buf)
  count <- mongo.count(mongo, "my_mongodb.my_mongodb_collection", query)
  if (count < 0) {
    print(paste('mongo.get.err code: ', mongo.get.err(mongo)))
    return (0)
  } else return(count)

queryIntoDataFrame <- function(column_name, column_value){
  total_count <- getCountsByMatch(column_name, column_value)

  # setup some vectors to hold our results 
  col1 <- vector("character",total_count)
  col2 <- vector("numeric",total_count)
  col3 <- vector("character",total_count)
  col4 <- vector("character",total_count)
  col5 <- vector("character",total_count)
  col6_1 <- vector("numeric",total_count)
  col6_2 <- vector("numeric",total_count)
  col7 <- vector("character",total_count)

  mongo <- mongo.create(host='localhost:27018' , db='my_mongodb')
  buf <- mongo.bson.buffer.create()
  mongo.bson.buffer.append(buf, column_name, column_value)
  query <- mongo.bson.from.buffer(buf)  
  cursor = mongo.find(mongo,"my_mongodb.my_mongodb_collection",query)

  i <- 1
  while (mongo.cursor.next(cursor)) {
    cval <- mongo.cursor.value(cursor) 

    if (is.null(mongo.bson.value(cval,"col1"))) col1[i] <- ''  
    else col1[i] <- mongo.bson.value(cval,"col1")

    if (is.null(mongo.bson.value(cval,"col2"))) col2[i] <- ''
    else col2[i] <- mongo.bson.value(cval,"col2")

    if (is.null(mongo.bson.value(cval,"col3"))) col3[i] <- ''
    else col3[i] <- mongo.bson.value(cval,"col3")

    if (is.null(mongo.bson.value(cval,"col4"))) col4[i] <- ''
    else col4[i] <- mongo.bson.value(cval,"col4")

    if (is.null(mongo.bson.value(cval,"col5"))) col5[i] <- ''  
    else col5[i] <- mongo.bson.value(cval,"col5")

    # below is an example of extracting data from a bson list, based on the value stored in the list.
    if (is.null(mongo.bson.value(cval,"col6"))) {
      # pick a default -1
      col6_1[i] <- -1
      col6_2[i] <- -1  
    } else {
      for (item in mongo.bson.value(cval,"col6")) {
        # populate col6_1 and col6_2 based on item$some's value
        if (item$some == 0){
          col6_1[i] <- item$other
        } else if (item$some == 1) {
          col6_2[i] <- item$other
      # pick a default -2, just tot show the difference
      if (is.null(col6_1[i])) col6_1[i] <- -2
      if (is.null(col6_2[i])) col6_2[i] <- -2

    if (is.null(mongo.bson.value(cval,"col7"))) col7[i] <- ''
    else col7[i] <- mongo.bson.value(cval,"col7")

    i <- i + 1
  df <- as.data.frame(list(col1=col1,col2=col2,col3=col3,col4=col4,col5=col5,col6_1=col6_1,col6_2=col6_2,col7=col7))
  # return the data frame
You can also find the gist at: https://gist.github.com/wingchen/8621215

Voila! you get your mongodb data into an ordinary data frame. Let's enjoy playing with data in data frames.

You only need to write the function once, according to your need, and you can then enjoy all the statistical and plotting function R has to offer.

At this moment, you have a web RStudio (RStudio server) sitting on top of your MongoDB. You can use your browser to script/plot/program upon your MongoDB data with ease.

Personally, I prefer this over the original MongoDB console.


The stack above is different from Pig/Hive with Hue, it's not a replacement. The RStudio server + MongoDB stack above is a web data science stack on MongoDB, but Pig/Hive  + Hue is a BIG data since stack. To turn RStudio into a BIG data one, we still have to hook Hadoop or Spark up. But it's a different topic then.

The downside of the stack above is that RStudio (like R), still loads EVERYTHING into it's memory. So, please be careful with your queries. Don't do anything that will hang your machine forever.

If you want this stack on your production mongodb, make sure you don't put it on your major router. Put it onto your backup one, don't let the operation traffic go through the same router where you conduct your high-memory/high-cpu consumption data analysis.

Also, if you have any thought that may improve the stack in anyway, please also let me know too. Thanks.