Moving from Hadoop to Spark – challenges and work arounds

We are using an Oozie workflow on our production Hadoop cluster to manage a pipeline containing 57 MapReduce(MR) jobs. This pipeline of MR jobs does the following –

  • Loads data from relational stores using Sqoop imports
  • Calculates aggregations across multiple dimensions for our OLAP offering (MR jobs written in java)
  • Writes end results back to RDBMS.

To make maxium use of processing resources on our cluster and minimize aggregation calculation times, we use Oozie forks and joins to run multiple MR jobs in parallel. These include Sqoop imports to fetch initial datasets before processing begins. After moving to Hadoop in production, we are able to calculate aggregations for our OLAP platform in a matter of minutes (down from a couple of hours using traditional Relational databases earlier).

With release of Apache Spark, we want to run our aggregation logic on a Spark cluster. Spark processes data in-memory where as Hadoop MR jobs persist data back to disk after both map and reduce phases. Writing output of map and reduce phase to disk becomes a big drag on performance when you have a pipeline of MR jobs. Even for non-memory intensive problem(s), Spark Engine can generate a General Execution/Operator Graph based on the code in Spark program and apply optimizations. We want to try Spark to see how sum of these optimizations across a big pipeline of 57 jobs would result in better performance. Spark’s support for operator chaining will also come in handy when pre-processing data (like applying filter function). After a careful look at our current Oozie workflow (MR pipeline), we determined that once we load data into HDFS (using Sqoop imports), we could use Spark to orchestrate our aggregation calculation logic.

Because Spark is still in early stages, it does not have support for Sqoop (to import data from Relational persistent stores) as yet. One of the requirements in our use case is to import initial data from various Relational databases (supported by our product) into HDFS. A shortcoming of using Spark (with Java) is that there is no easy/efficient way to load data sets from a relational store directly into Spark. Although early versions of Spark do support the concept of Jdbc RDDs, it is not the most efficient way of loading data. With the release of Spark 1.3, the team is pretty excited about new Spark SQL feature data source that can read data from other databases into DataFrames using JDBC. Once the data is read into a DataFrame, you can easily convert it into RDDs and using within your Spark code.

After testing this new feature extensively on our small test clusters, we found out that this feature is still is in experimental stages, and does not work for all databases. Although this feature worked well for all the cases we tested on MySQL and Postgress, it did not work at all for 90% of the tables in MS SQL server. Because of the problems we ran into while connecting to tables in MS SQL server and Oracle, we decided not to use this particular feature till it becomes more stable in future releases.

To get past this issue and still be able to build a PoC for solving our aggregation problem using Apache Spark, we have modified our existing Oozie workflow. In this new Oozie workflow, we initially load all required data into HDFS using Sqoop import actions in parallel using forks and joins. Once all the data has been loaded, we are running a Spark job (that contains our aggregation logic) as a shell script from Oozie workflow. Based on the initial results (on a small data set) in our test cluster, we have seen considerable performance improvement by running our aggregation logic on Spark. 

Although learning Spark has been pretty interesting, it does some present some challenges in terms of debugging/trouble shooting a problem as well as modeling a solution using the current API exposed by Spark engine.

As we move towards making transition from using Hadoop to Spark, I will be evaluating workflow engines like Pinball and Spark Job server that work well with Spark and will share my experiences in a blog.

Leave a comment