Covering Disruptive Technology Powering Business in The Digital Age

image
Introducing spark-cloudant, an open source Spark connector for Cloudant data
image
March 18, 2016

We would like to introduce you to the spark-cloudant connector, allowing you to use Spark to conduct advanced analytics on your Cloudant data. The spark-cloudant connector can be found on GitHub or the Spark Packages site and is available for all to use under the Apache 2.0 License. As with most things Spark, it’s available for Python and Scala applications.

If you haven’t heard of Apache Spark™, it is the new cool kid on the block in the analytics space. Spark is touted as being an order of magnitude faster and much easier to use than its analytic predecessors, and its popularity has skyrocketed in the past couple of years. If you would like to learn more about Spark in general, I recommend checking out the Spark Fundamentals classes on Big Data Universityand the great tutorials on IBM developerWorks.

Start fast with Spark on Bluemix

So how do you get going quickly in analyzing your Cloudant data in Spark? Luckily, IBM has a fully-managed Spark-aaS offering in IBM Bluemix that has the latest version of the spark-cloudant connector already loaded for you. Head on over to theBluemix catalog to sign-up and create a Spark instance to get started. Since the spark-cloudant connector is open source, you are also free to use it in your own stand-alone Spark deployments with Cloudant or Apache CouchDB™. Next, check out the README on GitHub, the Bluemix docs on Spark-aaS, and the great video tutorials on the Learning Center showing how to use the connector in both a Scala and Python notebook.

The integration with Spark opens the door to a number of new analytical use cases for Cloudant data. You can load whole databases into a Spark cluster for analysis. Alternatively you can read from a Cloudant secondary index (a.k.a. “MapReduce view”) to pull a filtered subset or cleansed version of your Cloudant JSON. Once you have the data in Spark, use SparkSQL for full adhoc querying capabilities in familiar SQL syntax. Spark can efficiently transform or filter your data and write it back into Cloudant or another data source. Because Spark has a variety of connection capabilities, you can also use it to conduct federated analytics over disparate data sources such as Cloudant, dashDB and Object Storage.

Example: Cloudant analytics with Spark

To provide another example of using the spark-cloudant connector, check out thisexample Python Notebook on GitHub and load it into your Spark service running on Bluemix. (It becomes interactive once you upload it to a Spark notebook using the instructions below.) This notebook does the following:

  • Loads a Cloudant database spark_sales from Cloudant’s examples account containing documents with sales rep, month, and amount fields.(Feel free to replicate the https://examples.cloudant.com/spark_salesdatabase into your own Cloudant account and update the connection details if you prefer.)
  • Detects and prints the schema found in the JSON documents.
  • Counts the number of documents in the database.
  • Prints out a subset of the data and shows how to print out a specific field in the data.
  • Uses SparkSQL to perform counts, sums, and order by value queries on the data.
  • Prints a graph of the monthly sales.
  • Filters the data based on a specific sales rep and month.
  • Counts and shows the filtered data.
  • Saves the filtered data as documents into a Cloudant database in your own account.

This article was originally published on ibm.com and can be viewed in full here

(0)(0)

Archive