A Brief Survey and Overview of AWS ETL Tools - Part 4
Apache Airflow is a tool that allows you to build and schedule workflows programmatically. Airflow itself can be run on-prem, or run within various cloud providers such as Azure, Google Cloud, etc.
Introduction
As mentioned in part one of this series, we're taking a look at the various data-centric ETL services that are offered on AWS. It might not always be obvious which service might be the most appropriate one for a particular use case or need, and this series of blog posts will hopefully help make the use of these services more clear. Part one focused on AWS's Database Migration Service (DMS) and the kinds of scenarios it works well for. Part two focused on AWS Data Pipeline, part 3 on AWS Glue, and in this article, we will focus on one of the newest AWS services, AWS Managed Workflows for Apache Airflow (MWAA).
Apache Airflow
Apache Airflow is a tool that allows you to build and schedule workflows programmatically. Airflow itself can be run on-prem, or run within various cloud providers such as Azure, Google Cloud, etc. It's managed and released by the Apache Foundation, and does not tie into any specific vendor. For organizations where not being locked into a particular vendor is a priority, this might be a service really worth looking at. AWS MWAA is Amazon's fully managed service that allows you to host and build cloud-managed Airflow workflows within the AWS cloud and ecosystem.
Another important point to understand is that Airflow - unlike previous tools discussed - is not a dedicated tool solely for "data" - it's a much more general purpose. Airflow is essentially a task scheduler, albeit a very rich, functional, robust, and scalable one. Airflow is really only concerned with the "when" and "how" part of the pipeline/workflow not the "what". Another important point to be aware of is that workflows are authored programmatically using the Python programming language, not visually using a GUI or drag and drop tool.
For the purposes of our overview, we're going to take a high-level look at DAGs, tasks, providers, and some of the other out of the box functionality that Airflow provides, and we will also discuss the benefits you get from using AWS's hosted version of Airflow, the MWAA service. We will not be taking a deep look into Airflow internals and architecture.
DAGs and Tasks
The most fundamental concept of Airflow workflows is the DAG. DAG stands for "Directed Acyclic Graph", and it's Airflow's internal representation of the workflow you're authoring. DAGs are defined via Python scripts as part of the workflow authoring process. This script is where you define your tasks you want to execute, their dependencies, and the order they should be executed in. In addition, you can specify which tasks can be run in parallel, have retry logic, or run on an alternate schedule - among many other possibilities. However, one thing that DAGs do not do is handle the work that each task in your workflow will perform. This is part of the task definition's responsibility - as mentioned prior, DAGs are concerned with the "what" and "when", not the "how". Since this is a very high-level overview, we won't be diving too deep into the details and internals of Airflow. However, DAGs are a fundamental architectural pillar of Airflow, so it is worth learning more about them - you can read the details about them here
A task is a specific unit of work within your workflow. They're written in Python, and are based on "Operators". An Operator is essentially the type of work or commands you want your task to execute. For example, if you want to run some arbitrary Python code your task can use a PythonOperator. If you'd like to send an email, you can use an EmailOperator. Airflow comes out of the box with a nice set of operators ready to go for a variety of task definitions. Also included out of the box are special operators called Sensors, which allows you to set up tasks that poll or wait for certain conditions, such as a file appearing in an S3 bucket or a specific database record or key that's inserted. In addition to the out-of-the-box operators, you can also write your custom operators (or sensors) for your own workflow-specific functionality and needs.
Providers
As mentioned above, it's possible to write your own Operators for various kinds of task-based needs. It turns out that there are many 3rd parties who have done just this, and have made them available to the public for their own tasks. Many common integrations such as integrations for databases (MySQL, PostgreSQL, etc.), AWS, Spark, ElasticSearch, Jenkins, Snowflake, Slack, and many, many other well-known systems are available as provider packages that you can install and import into your workflows and tasks. (A much more comprehensive list is available here). With these providers available, it becomes possible to create feature-rich and sophisticated workflows that have very complex scheduling, retry, and parallelization logic with a variety of modern data and server platforms. This combination of workflows, tasks, and providers becomes an ideal way to write sophisticated ETL logic and data pipelines - which are essentially just another kind of workflow.
UI and Visualization tools
To help monitor and troubleshoot your workflows and pipelines, Airflow comes out of the box with some nice visualization tools. Airflow's base UI defaults to a screen that shows all of your DAGs (workflows) their last run time, their schedule, and other details. From there you can drill into other screens that display and visualize other information. There is a tree view that shows all the tasks for a specific pipeline and whether they've run or not, which can be helpful for troubleshooting. There is also a graph view that shows the current condition of a running pipeline and shows a visualization of task dependencies, durations, and other relevant information. A list of screens along with various screenshots can be seen here.
AWS Managed Workflows for Apache Airflow (MWAA)
With some Airflow basics out of the way, we can move onto AWS's cloud-native service for it. AWS provides quite a bit of compelling features and integration over top of native Airflow that turn it into a first-class cloud-native workflow solution. Let's take a look at some of the features that AWS brings into its Airflow service:
- Automated provisioning: AWS will automatically provision new workers in your Airflow environment as needed. This allows for autoscaling of workflow resources and no need to manually provision or configure hardware. For a distributed system like Airflow this is a huge operational win and significantly lowers the barrier to entry and cost of setting up and running it.
- Monitoring/Logging: AWS tightly integrates Airflow's own feature-rich set of logging and monitoring with AWS CloudWatch. This allows for one central place for capturing and observing all your workflows and associated metrics and log information. It also allows you to leverage all the alerting and event-based functionality that CloudWatch offers as well. This integration really improves and strengthens Airflow's already impressive logging and monitoring capabilities
- Containerization: If you are using container-based solutions to host Airflow, or are interested in hosting it via containerization, MWAA integrates completely with AWS Farsight, AWS's container management solution. It allows hosting a fully containerized instance of Airflow, and will also allow scaling on-demand by creating more containers as needed for each of your workflows.
- AWS Integration: As with many other AWS services, Amazon has provided first-class integration for Airflow with many of its other services. Athena, Batch, DynamoDB, EMR, Lambda, Glue, Redshift, SQS, Kinesis, SageMaker, and S3 are just some of the services AWS provides out of the box Airflow providers for, making it easy to integrate all the functionality of these services into your pipelines.
- Security: Again as with any AWS service, Airflow is completely integrated into AWS's IAM authentication and authorization services. Airflow is also integrated into AWS's VPC infrastructure as well.
Aside from the above, MWAA provides many other AWS integrations on top of Airflow to enhance its functionality and make it a true cloud-native service. However, it's important to keep in mind, that the workflows themselves are still just Python scripts - not some AWS proprietary format, so your workflows are still highly portable and for the most part vendor-neutral. Which is great for those who do not want to be locked into a single cloud provider or environment.
Airflow and Glue
Another tool in the AWS ecosystem that allows for the creation of workflows and ETL pipelines is AWS Glue. An obvious question being which is the more appropriate tool to use. Despite the fact that they do have some overlapping functionality, they are quite different. Airflow is a general-purpose workflow authoring tool - it just so happens that it also really excels at creating data-focused workflows and data pipelines. However, it's much, much more capable than just working with data, and can be used for far more general purposes. Airflow has a very large set of providers available for it that allow it to integrate to a far wider range of systems and platforms that aren't necessarily data-centric or that need to be used for ETL purposes. If you have workflows that might incorporate API calls, shell scripts, programmatic automation, or other general needs, then Airflow would definitely be the tool to use. Also, as mentioned several times previously, there is nothing about Airflow that ties it into AWS, and Airflow scripts are completely cloud-vendor neutral.
Unlike Glue, it doesn't have functionality used for data governance or cataloging purposes and has no built-in functionality for creating or maintaining AWS S3 based data lakes. If your workflows and data engineering efforts are based around AWS data lakes and you're looking to leverage tools like AWS Lake Formation, then Glue would be a better fit. Also, if you're looking to reduce overhead with data governance efforts and need access to a rich metadata-based catalog for your data sets, then Glue would be a better fit as well. Another important difference is that Glue uses Spark under the hood, so if you already have a large investment in Spark for your current ETL purposes, Glue might be a better fit as well, due to the fact that you might have far less code to re-write or re-engineer.
Official Apache Airflow Homepage
Official Apache Airflow Documentation
Official AWS MWAA Homepage
Official AWS MWAA FAQs
Official AWS MWAA Documentation
The JBS Quick Launch Lab
Free Qualified Assessment
Quantify what it will take to implement your next big idea!
Our assessment session will deliver tangible timelines, costs, high-level requirements, and recommend architectures that will work best. Let JBS prove to you and your team why over 24 years of experience matters.