Streamlining Data Workflows: The Essential Guide to Apache Airflow

Imagine you're juggling multiple tasks—downloading data from APIs, cleaning it up, running complex transformations, and finally pushing it into dashboards or machine learning models. Doing this manually can feel like herding cats, where every little misstep can cause chaos. Now, multiply this complexity by ten if you’re managing workflows for an entire organization. Overwhelmed? This is exactly where Apache Airflow swoops in like a superhero.

Apache Airflow is a workflow orchestration platform that simplifies the automation of these tasks. It not only keeps your workflows organized but ensures they run smoothly, saving you from unnecessary headaches.

Let’s dive into what makes Apache Airflow indispensable for modern data engineering and how you can leverage it.


Airflow Architecture: Behind the Scenes

To really get the hang of Apache Airflow, you need to understand how its architecture works. Think of it like an orchestra, where every instrument has a role, and the conductor (that’s Airflow) makes sure everything plays in tune. Without the right setup, things would sound... well, pretty off-key.

Key Components

airflow.cfg

This is Airflow’s main configuration file—basically the rulebook. It tells Airflow how to run the show. From database settings to executor options, it’s all in here.

Example: Want to swap your database from SQLite to PostgreSQL? Just update the connection string in airflow.cfg and voilà—Airflow’s on a whole new beat.

webserver.py

The webserver is Airflow’s stage, where all the magic happens. It’s the dashboard you use to watch your workflows perform.

Example: Want to Integrate LDAP authentication in UI ? Setup few configurations here and you would be good to go!

scheduler.py

This is the brain behind the curtain, the mastermind who decides which tasks to play and when. It’s constantly checking task dependencies and execution status, making sure everything is in sync.

core_dags folder

The core_dags folder is where your DAG definitions live, acting as the central hub for all your workflows, all your dags should be added to this location.

metadata database

The metadata database is the memory of Airflow. It stores all the critical data—task states, logs, execution history—you name it.

How It All Fits Together

  • The scheduler picks tasks from the DAGs and hands them off to the executors.

  • The webserver shows you the live performance, so you can monitor how your workflows are doing in real time.

  • The metadata database keeps everything organized and in sync, making sure Airflow remembers every detail of the performance.

This modular design is what makes Airflow both powerful and flexible.


Setting Up Apache Airflow

Let’s get you cooking with Airflow! Setting it up is easier than you might think, even if you’re starting from scratch. Here’s a step-by-step guide.

Step 1: Install Python

If Python isn’t installed on your system yet, don’t worry! Just follow these guides:

Once Python is installed and verified using python --version, Activate your virtual environment before proceeding:

python -m venv airflow-env  
source airflow-env/bin/activate  # Linux/Mac  
airflow-env\Scripts\activate     # Windows

Step 2: Install Apache Airflow

With the virtual environment up and running, it’s time to install Airflow. Run this command:

export AIRFLOW_HOME= {airflow_config_path}
AIRFLOW_VERSION= {version}
PYTHON_VERSION= {version}
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
pip install pymysql # For sql as metadatabase
pip install psycopg2 #For postgres as metadatabase

Pro Tip: Use a dedicated Python virtual environment to avoid dependency issues.

Step 3: Setup and Initialize Metadata Database

Airflow requires a metadata database to track task states, DAG runs, and logs. If you don’t already have a SQL database set up, follow these guides to get started:

Configure your database connection in the airflow.cfg file:

sql_alchemy_conn = mysql+pymysql://<username>:<password>@<host>:<port>/<database_name>  
# For PostgreSQL:  
# sql_alchemy_conn = postgresql+psycopg2://<username>:<password>@<host>:<port>/<database_name>

Use the following command to initialize the database, we will come to it as on why is it required:

airflow db init

Step 4: Create an Admin User

To access the Airflow UI, you’ll need an admin user. Create one with the following command:

airflow users create \
    --username admin \
    --password admin \
    --firstname Firstname \
    --lastname Lastname \
    --role Admin \
    --email admin@example.com

Step 5: Start the Airflow Web Server

Run the following command to start the web server:

airflow webserver --port 8080

The Airflow UI will be available at http://localhost:8080.

Step 6: Start the Scheduler

Airflow relies on a scheduler to execute tasks. Start it using:

airflow scheduler

Step 7: Verify the Setup

Navigate to the Airflow UI and confirm everything is running correctly. You’re now ready to create and run workflows, like below!


But, Why Is Database Configuration Required?

Picture this: without a reliable database, Airflow would be like trying to organize a massive group project where no one remembers what tasks are done, who's working on what, or when the deadlines are. Total chaos, right? That’s why the metadata database is Airflow’s secret weapon—it keeps everything organized and running smoothly.

Why It Matters

  • Tracks Workflow States: The database keeps a record of DAG runs, task statuses, and logs, ensuring you know what’s completed and what’s pending.

  • Handles Scheduling: It stores task dependencies and execution times, acting as Airflow’s scheduler.

  • Enables Monitoring: Provides real-time updates visible on the UI, so you’re always in the loop.

Common Database Options

  • SQLite: Perfect for local testing and quick experiments.

  • MySQL/PostgreSQL: Ideal for production environments, offering better performance and scalability.


Scaling Apache Airflow: From Local to Distributed

So, you’ve set up Airflow and it's running like a charm. But as your workflows grow and the number of tasks increases, you’ll need Airflow to scale. Fortunately, Airflow has you covered with different executors that let you choose the right fit based on your workload.

Executors: Pick the Right One for the Job

  1. Sequential Executor
    The Sequential Executor is the simplest of the bunch. It handles one task at a time, which makes it great for testing or running tiny workflows. But if you’re trying to process multiple tasks, it’s like delivering packages one by one on foot—reliable, but super slow.

    Use It For: Local testing or workflows with minimal complexity.

  2. Local Executor
    Need to level up? The Local Executor lets you process multiple tasks in parallel on a single machine. Think of it as upgrading from walking to driving a small car—it’s faster, more efficient, and can handle a bit more traffic.

    Use It For: Medium-sized workflows where running tasks on one machine is sufficient.

  3. Celery Executor
    Now we’re in the big leagues. The Celery Executor is built for distributing tasks across multiple machines (workers), so you can scale your workflows to handle a ton of tasks at once. It’s like running a well-coordinated delivery fleet instead of relying on a single car—quick, efficient, and capable of handling high demand.

    Use It For: Large-scale workflows that require heavy-duty parallel processing and scalability.

Message Broker: The Secret Sauce for Distributed Tasks

When you use the Celery Executor, you need a message broker to manage task distribution, and that’s where brokers like Redis, RabbitMQ, SQS etc comes into play. Message brokers acts as the communication hub, making sure tasks are assigned to the right worker without missing a beat. Below are certain options provided by airflow:

Why Message Brokers are necessary?

  • Scalability: Effortlessly scale by adding more workers to handle an increasing number of tasks.

  • Fault Tolerance: If a worker fails, the broker ensures tasks aren’t lost—they're rerouted to another worker.

  • Performance: As an in-memory store, brokers like Redis handle high-volume task distribution with minimal delay.

For the purpose of this article, I will use Redis as the message broker. However, be sure to choose a broker that best fits your specific requirements.

How to Scale with Celery Executor and Redis

  1. Install Redis

Redis acts as the message broker for Celery. It queues tasks and ensures they’re executed by available workers. To install Redis, follow the official installation guide:

After installation, verify that Redis is running by checking its version: redis-server --version

  1. Install Redis and Celery for Airflow

     #activate existing venv and install the libs
     pip install "apache-airflow[redis,celery]==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
    
  2. Update Airflow Configuration
    In airflow.cfg, configure Airflow to use the Celery Executor and point it to your Redis instance:

     executor = CeleryExecutor
     broker_url = redis://localhost:6379/0
     result_backend = redis://localhost:6379/0
    
  3. Start Workers
    Launch your workers, and they’ll begin pulling tasks from Redis as they become available.

     airflow celery worker
    

Now, your setup can scale horizontally, and Airflow will handle tasks across multiple machines, all managed by Redis. The flow would look like below:


Pros and Cons of Apache Airflow

ProsCons
Mature Ecosystem: Many plugins and integrations available.Local Development Challenges:Harder to replicate environments.
Python-based: Airflow uses Python, making it accessible to a wide range of developers.Difficult Debugging: Unstructured logs and UI make troubleshooting harder.
Flexibility: Everything is code, offering complete control over workflows and task logic.Limited Data Lineage: Poor visibility into data dependencies.
Scalability: Airflow can scale for large workflows and multiple users.Limited CI/CD Support: Difficult to implement automated testing.
Strong Community: A large ecosystem and community support contribute to its widespread use.

Choose the orchestration tool based on your specific requirements, infrastructure considerations, and cost.

Alternatives to Apache Airflow

While Airflow is incredibly versatile, it’s not the only orchestration tool out there. Here are some alternatives:

  • Prefect: Easier to use, with task-level retries and better fault tolerance.

  • Luigi: Lightweight and great for simple workflows.

  • Dagster: Focuses on data quality and lineage tracking.

  • Google Cloud Workflows: Ideal for cloud-native workflows.


Wrapping It All Up

Apache Airflow is like having a dependable co-pilot for your workflows. From orchestrating complex pipelines to automating mundane tasks, it’s a game-changer for data engineering.

And you know the best part? Airflow has excellent documentation that covers everything in detail. Whether you’re troubleshooting or exploring advanced features, you’ll find all the answers here.

Whether you’re a beginner or an experienced engineer, mastering Apache Airflow will elevate your data workflows and unlock new possibilities. Ready to take the leap?