Herrick

ETL Frameworks

  • Created: 08 Jul 2020
  • Modified: 14 Aug 2020
  • Status: draft
  • Importance: 5

Table of Contents

    When and why do we need ETLs?

    ETL stands for extract, transform, load. It’s the procedure of copying data from sources into other ones while adding a few transformations along the way. For instance, several input data sources, like a NoSQL database, several JSON files, and a relational database can be used by a script with desired business logic. Then, move the output to another relational database. Another example flow would be to use text files in Amazon’s S3 buckets, aggregate information or produce new information from those sources, and upload transformed text files in other S3 buckets for consumption.

    Before working on massive ETLs with more than 100+ nodes in a pipeline, I would have found it hard to understand the utility and necessity of these workflow management frameworks. After those experiences, I would say that as your data complexity increases, you need ways to abstract this process for time and space efficiency. A whole field of data engineering has really picked up as data size has increased and storage has become ridiculously cheap.

    For another perspective, Nate Kupp wrote up a great blog post on the various stages of scaling data engineering with respect to your data size in this medium post.

    Additionally, there is another paradigm of ELT for extract, load, transform. This is interesting because there’s an idea of storing raw data and creating transformations after some later aggregation. Tools like dbt support this paradigm enabling the analysis part of data to scale more fluidly.

    Mini Qs

    • Why not store everything in one place? This kind of goes against the concept of normalization of databases.

    • Who should be in charge of the ETL? Should it be the team that owns the E or the L? After working for a little bit, I think this comes down to how complicated your data models are and how teams work cross-functionally. If a data model is extremely complex and not well-documented, it may be difficult for other teams to parse the correct data.

    • What instrumentation would provide a lot of value? Typically, you’ll have a scheduler that can run tasks at various frequencies. Having a graphical view on those schedules would help with understanding when things are scheduled and which pipeline graphs get created. Additionally, you seeing where data is use/consumed would be helpful for tracking where data goes kinda like how tagging marine animals can help people figure out migration patterns in the ocean.

    • Why can’t we just use cron and some X language code? You can. It’s just that as things grow more complex, simple scripts may not cut it to make your engineers happy. Also, you miss properties of a good ETL framework that make on-call and data tracking a lot easier.

    What is a good ETL framework?

    There are several components for making an ETL framework valid.

    • Idempotency: It’s important for an ETL to have the property that when you give it the same inputs, it will produce the same outputs. This is especially important considering pipelines failing at various points in time.
    • Replayability: Sometimes, you want to replay a pipeline with some modified changes. In this case, you’d want the ability to pause and play certain parts of a pipeline to make your fixes. So, some sort of control would be important. This includes the idea of being able to run a pipeline again from scratch as well.
    • Observability: You definitely want to be able to know what is happening with your pipeline at any time. This would be helpful to understand elements of your pipeline runs. For instance, you’d want to know about timing, what has already run, error codes, stack traces, etc.
    • Programming Abstractions: You’d want a way to make things simple and easy to program with respect to how you produce complicated pipelines. Having a strong mental models about what a pipeline is lets you decompose your pipeline into something more usable.

    Overview

    Working at my previous job, we ran luigi pipelines to handle daily processes. Luigi was chosen because it was the most popular workflow management solution prior to 2015. Using the framework along with a few add-ons, we had better control of scheduling and processing over using something like cron. I’ll go into how luigi works and dig into Airflow as well.

    Luigi

    Luigi invokes two executables, the luigid visualizer as a daemon and luigi as the scheduler and runner itself.

    Luigi allows you to add metadata for a task instance via Parameters. “Parameters are objects set on the Task class level to make it possible to parameterize (adding settings to adjust) tasks.” You can imagine this via

    class SomeTask(luigi.Task):
        temp = luigi.Parameter()
    

    In luigi, the fundamental unit of work is luigi.Task, which moves things around via Targets, which we can think of as data assets. Tasks are objects that can depend on each other via the requires method and interact with various inputs and output (Targets) by stringing them together like in the docs example.

    What’s nice about Luigi is that you can set up requires() to take in dependencies, which makes it easy to depend on another python class that does something as ewll. This comes at a trade-off of being difficult to track the specific use cases without starting a scheduler. In alternative frameworks, you would explicitly declare your graph, whereas luigi lets you think of a single task and only its upstream dependencies.

    Airflow

    To be written.

    Other Sources