Building a Scalable Data Processing Platform for Analytics – Part 2

In Part 1 of this 4 blog series, I discussed the key factors that influenced our design and architecture around building  Data Processing Platform (DPP). In this post, I will talk about the tools and  frameworks we used to build DPP.

Initially, we used a small number of mission-critical use cases to design and build a platform that met those needs, which included sizing the cluster, estimating data ingestion loads (bandwidth and storage), and planning for future growth. Once we had initial solution working, we went through multiple iterations and numerous other (next-generation) use cases to devise an abstract, widely applicable data processing architecture for that was open, flexible, extensible (in size and function), and very importantly: horizontally scalable. Horizontal scale may best be defined as “Scalable through the addition of new nodes at a constant or diminishing cost.” We chose a  a hub-and-spoke approach, where we can move, transform, and re-index data from a growing body of silos into a centralized system (our DPP) to support self-service discovery, reporting, and analytics.

Rather than restricting ourselves to any specific system or tools, we use an ever-evolving collection of tools. The core hot (“readily available”) storage layer for our DPP is Amazon S3 and Hadoop Distributed File System (HDFS), a distributed file storage system with MapReduce as its processing engine. MapReduce lets us break big jobs into smaller ones. It handles all of the complexities associated with big-small and re-aggregating the small requests back to the original, big request. To manage the size of our ever-growing cluster, we used Glacier as cold (think “available, but not immediately necessary”) storage layer for older data that loses relevance with time.

For Hadoop deployment, we used Cloudera’s distribution. This is important to note because when working in the realm of open-source or free-license software, one needs to understand where free today may mean expensive tomorrow. Cloudera is a good bet for long-term reliability and availability, so it suits our needs. We used Cloudera Manager to build, manage, monitor, and resize a cluster containing hundreds of nodes, while we also used the Cloudera Navigator to handle data management and authorization. In a Multi-Tenant (i.e., serving many clients) cluster setup, we used FairScheduler, defining hierarchical queues with pluggable policies to share resources so that all applications get, on average, an equal share of resources over time. Tools like Sqoop, Flume, Pig, Morphlines along with Kafka as the distributed messaging service were used to build data ingestion pipelines, whereas Oozie was our workflow orchestration engine to run jobs – either on demand or on a predefined schedule. Hive provided SQL like access to HDFS where as SOLR was our default free form text search engine for data discovery in DPP. Apache Spark was our fast and general purpose engine for data processing, streaming as well as machine learning.

In Part 3 of this blog series, I will describe high level data processing architecture, its  different components and how they interact with each other.

 

2 thoughts on “Building a Scalable Data Processing Platform for Analytics – Part 2

Leave a comment