Running Traditional Applications on Hadoop

         I work on a cloud-based SaaS business analytics software that transforms multi-dimensional disparate enterprise data into actionable insights through user-friendly infographics, charts, and dashboards. There has been an exponential growth in the amount of data generated across the enterprise. and we’ve re-engineered our product to run on Big Data technologies powered (Hadoop eco-system) for scale and performance.

         We are transitioning from a RDBMS based java application that aggregates data from disparate sources across multiple dimensions to an application contains dozens of Map Reduce(MR) jobs managed via Oozie workflows that perform joining, filtering and aggregation. The Oozie workflows use a combination of MR jobs, Sqoop actions, java actions, decision controls, forks and joins, EL functions. This presents a set of unique challenges as we move our core processing from relational databases (making heavy use of temporary tables) to an approach where bulk of our processing is being done in parallel using Map Reduce Programming model, and the end results are pushed back to a RDBMS, which is then being used by the web interface to display inforgraphics.

         In this post, I will share our experiences and the lessons learnt during the transformational journey of re-architecting an RDBMS based application that needed to scale to process 1TB to 10TB of data everyday.

  • Try to keep your MR jobs as simple as possible. Each MR job writes output to storage/disk. If you are using a series of MR jobs in a pipeline, try to combine multiple MR jobs into a single MR job if possible, to improve performance.
  • MultipleOutputs class lets you write output of your MR jobs to multiple directories. This becomes very useful when you want to segment your output into different directories based on some business logic.
  • Use Sqoop direct table import to transfer data from our Relational Store into HDFS. Once the data is imported into HDFS, use MR jobs (driven by Oozie workflow) to process the available information. Sqoop will use Hadoop’s parallelism to fetch different slices of data in parallel independent tasks using table’s primary key by default to slice the data. To import data from more than one table into HDFS, use Sqoop free-form query option.
  • I’d like add a note of caution re Sqoop free-form queries. Use Sqoop free-form queries with caution since they can cause performance issues. In addition to Sqoop free-form query feature being slower than direct table import, one needs to use the –split-by parameter and –boundary-query parameter to slice the data being imported along with $CONDITIONS in where clause. The value(s) of column(s) used to slice data, if not uniformly distributed, can cause a severe performance impact with just a few jobs import bulk of the data.
  • Sqoop free-form query imports do not work that well with complex queries containing multiple joins as well as queries containing inner queries. A great workaround for executing complex queries is to read data from individual tables using sqoop, and then use MR jobs to join, filter or aggregate to generate to generate desired result.
  • If your product uses multiple relational databases like MySQL, Oracle, MS SQL server etc, you’ll will have to figure out the most efficient way to load the processed data from HDFS to multiple relational databases. 
    • Sqoop provides upsert functionality with syntax (using parameter –update-mode with value allowinsert) which allows updating an already existing record OR inserting a new record if none already exists in the table.
    • Upsert functionality does not work for MS SQL Server.
    • We are able to solve this issue by using MultipleOutputs class in our MR job to create separate output folders for records that need to be inserted or updated in RDBMS table. This allows us to use 2 separate Sqoop jobs pointing to different sets of files to perform inserts (using –update-mode parameter) and updates (using –update-key parameter to specify columns that uniquely identify a row of data).

Leave a comment