We have been running Cloudera Hadoop Distribution (CDH) cluster in our data center. As we migrate more and more of our applications/products away from traditional architectures to Big Data, we have felt the need to use cloud for our infrastructure needs instead of an in-house Hadoop cluster. This should bring down our infrastructure and operational costs significantly. To move from an on-premise CDH cluster to AWS Elastic MapReduce(EMR) cluster, we are presented with various options regarding how to setup the cluster :
- Continue using tools like Oozie to manage workflow for MapReduce jobs or use tools provided by Amazon like AWS Data PipeLine, AWS SQS for orchestrating workflow.
- Use either S3 or HDFS to store data that is being processed.
- Use either transient EMR cluster that terminates after processing or a long running EMR cluster.
We realize that answer to questions above depends on use case(s) as well as technology stack being used to solve the problem. This can be done by defining data application pipelines (workflows) that include multiple map reduce jobs, java actions, sqoop jobs, pig scripts etc. In our case, we invoke these workflows either on a pre-defined schedule or in real time based on user actions, when certain conditions are met. For our on-premise CDH cluster, we are using Oozie co-ordinator system to define and execute recurrent workflows running on pre-defined schedule. For use cases that needed real time triggers, we are using Oozie workflow jobs which can be invoked dynamically using java code.
In my next blog (part 2), I will describe how we have setup our AWS EMR cluster, what options did we choose while configuring the cluster and reason for choosing those options.