Transition from traditional architecture to Real Time Processing using Apache Spark

Many traditional applications fail to handle exponential increase in volume and velocity of business data being generated in today’s day and age. In this post, I want to share my experiences during our journey to move our commercial offering from a traditional java RDBMS architecture to move core processing logic to MapReduce Programming model (on Hadoop) and eventually to Apache Spark.

The figure below shows our traditional aproach to aggregate data:

Fig1

As the amount of data we aggregate increased from a few hundred Giga Bytes to a couple of Tera Bytes, we started having performance/scale issues with our traditional java RDBMS approach. We moved our core aggregation logic to Hadoop (orchestrated via Oozie Workflow) as show in figure below :

Fig2

With Hadoop, although we had significant performance improvement over traditional approach, the processing still took quite some time because of heavy I/O, generation of intermediate files during each step of Oozie workflow. To run our aggregation logic in real time, we eventually moved it to Apache Spark as described below :

Fig3

 

Leave a comment