Enterprise Data Lake – Data organization on HDFS

Organizing data in HDFS

We built an Enterprise Data Lake on Hadoop. The aim was to provide a common data hub for entire organization to store data to facilitate data sharing across teams and business units (that was not easily possible earlier). There were multiple steps involved in creating enterprise level lake – automating data movement pipelines (Kafka, Sqoop, Flume etc), HDFS data storage format, data transformations (Map Reduce, Spark, Oozie, Hue), full text search of stored data for data discovery (SOLR). Although I can talk for hours (writing for hours is much more difficult) for each one of steps mentioned above, in this post, I will concentrate primarily on how we organized HDFS data storage.

 

Hadoop’s Schema-on-Read approach does not put any restrictions regarding how data is ingested into HDFS, BUT having some structure/order/standards around stored data gave us many benefits.

  • Standard structure made it easier to share data between teams working on same data sets.
  • Standard HDFS structure promoted reusability of data processing code.
  • Standard HDFS data organization allowed us to enforce multi-user access and quota controls (support for multi-tenancy).
  • We set up organization wide standards to stage data from external sources (B2B applications) before moving it to data lake. This prevented partially loaded data from being used for processing pipelines. We also used staging area to act as a silo to vet external data (for correctness).
  • Implemented cluster security using Kerberos and encryption (to prevent access to sensitive data).

HDFS – Directory structure and organizing file(s):

Defining standard practices around how data is organized on HDFS and enforcing them made it easier to find, use and share data. Based on our experience and mistakes we made, I suggest the following HDFS directory structure (at a high level):

  • /user/{individual_user_sub_directory}: place to hold user specific files (xml, configuration, jar, sample data files) that are not part of business process and primarily used for ad-hoc experiments.
  • /applications: location containing 3rd party jars, custom uber-jars, oozie workflow definitions, pig scripts etc that are required to run applications on Hadoop.
  • /dsc: top level folder containing data in various stages of extract, transform and load (ETL) workflow. There will be sub-folders under root folder for various departments/groups OR applications that own the ETL process. Within each department or application specific sub-folder,  have a directory for each ETL process or workflow. Within each workflow/process folder, have a sub-folders for following stages: input, output, errors, intermediate-data.
  • /data-sets: folders containing data-sets that are shared across the organization. This included raw data-sets as well as data created via transformation, aggregation in various steps of dsc. There should be strict controls around which user(s) can read and write its data with only ETL process(es) having write access. Since this folder acts as root folder for shared data-sets, there will be sub-folders for each data set.
  • /tmp: short term storage for temporary data generated by tools or users.

In addition to the above directory structure, I suggest that one should make good use of techniques like Partitioning (reduces I/O required during data processing) and Bucketing (breaking large sets into smaller, more managable sub-sets) for organizing data.

In another post, I will talk about how we built complex pipelines to automate data movement from web logs, server logs, databases, B2B files to Enterprise Data lake using Kafka, Flume, Sqoop as well as how we used SOLR to search for data in the lake.

 

2 thoughts on “Enterprise Data Lake – Data organization on HDFS

  1. Just found this blog. Great posting. Any thoughts on if you would put /tmp under /usr or does the flattened org structure help with management?

    Like

    1. Temporary files generated by applications and tools were placed in /tmp folder under the root HDFS folder (/) in our case. But having a tmp folder for each individual user under /usr/{user_name}/ folder would work equally well.

      By the way, to support peta-byte scale data processing at optimal costs, we ended up moving our Enterprise Data Hub to the cloud from an on-premise data center deployment :
      – AS3 became default storage layer instead of HDFS
      – Used EMR to scale-up or scale-down cluster to handle variable loads instead of building a cluster with fixed capacity
      – Used Apache Spark as default data processing engine for DI and ML
      – Continued using Apache Spark with Kafka for real time streaming and building complex data pipelines
      – Used serverless functions

      Like

Leave a comment