Local Dev with Astro CLI and Remote Databases

When working with Astronomer and Airflow for data processing and ETL tasks, it’s common to encounter scenarios where you need to interact with databases hosted within a private subnet in in your cloud provider. In this blog post, we’ll demonstrate how to set up local development for Astronomer and Airflow while connecting to a Redshift database hosted in a private AWS subnet.

TLDR: We’ll achieve this by adding a new container to the Docker Compose file that contains the Astro CLI configuration and configure the Airflow Connection to use that container to reach to your remote DB.

Prerequisites

Before diving in, make sure you have the following installed and configured:

  1. Astro CLI (instructions here)
  2. Docker (instructions here)
  3. A Redshift cluster (or any other database) hosted in a private subnet
  4. A bastion EC2 instance in the same subnet as the DB that you are trying to connect to.

Step 1: Create a Docker Compose Override File

  1. In your local Airflow project directory, create a new file named docker-compose.override.yml.
  2. Add the following configuration to the docker-compose.override.yml file:
version: '3'
services:
  ssh-tunnel:
    image: /ssh_tunnel/
    depends_on:
      - scheduler
    environment:
      - SSH_USER=<YOUR_SSH_USER>
      - SSH_HOST=<YOUR_BASTION_HOST_PUBLIC_IP>
      - SSH_PRIVATE_KEY_FILE=/ssh/id_rsa
      - SSH_PRIVATE_KEY=<YOUR_SSH_PRIVATE_KEY>
      - TUNNELS=5439:<YOUR_REDSHIFT_ENDPOINT>:5439
    volumes:
      - ./ssh/id_rsa:/ssh/id_rsa:ro
		network:
			- airflow

Step 2: Define the docker image to be used to open the connection

For this, we need to create a new directory to contain the Image file you are going to use (ssh-tunnel) in your Airflow project containing a dockerfile. Alternatively you can publish this image to docker registry like ECR.

...
.astro
.github
dags
include
ssh-tunnel\
   Dockerfile
...

And include the following image code:

# Base image
FROM ubuntu:20.04 AS base
LABEL maintainer="[email protected]"

# Set non-interactive installation mode
ENV DEBIAN_FRONTEND=noninteractive

# Update and install necessary packages
RUN apt-get update && apt-get install -y \
    curl \
    openssh-client \
    zip && \
    apt-get upgrade -y && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /home

# Install AWS CLI V2
RUN curl -s https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip -o /tmp/awscliv2.zip && \
    unzip /tmp/awscliv2.zip -d /tmp && \
    /tmp/aws/install && \
    rm -rf /tmp/aws /tmp/awscliv2.zip

# Install AWS SSM plugin
RUN curl -s https://s3.amazonaws.com/session-manager-downloads/plugin/latest/ubuntu_64bit/session-manager-plugin.deb -o /tmp/session-manager-plugin.deb && \
    dpkg -i /tmp/session-manager-plugin.deb && \
    rm -f /tmp/session-manager-plugin.deb

# Expose necessary ports
EXPOSE 22 5439

# Set default command
CMD ["bash"]

In this example we are using aws-cli and aws-ssm to reach the bastion. Please adapt as needed. Another option is to use the default aws docker image.

Step 3: Open the tunnel

To open the tunnel you will need to execute an ssh command:

ssh -i /path/to/my_id_rsa -L 5439:my-redshift-cluster.abcdefg.us-west-2.redshift.amazonaws.com:5439 -N -4 [email protected]

Step 4: Create an Airflow Connection

In order to use the SSH tunnel container to access the Redshift cluster or other databases in the private AWS subnet, you need to create an Airflow Connection with the correct settings.

  1. Open the Airflow UI by visiting localhost:8080 in your web browser.
  2. Click on the “Admin” menu in the top navigation bar and then click “Connections.”
  3. Click the “Create” button to create a new connection.
  4. Fill in the following fields:
    • Conn Id: Enter a unique identifier for this connection (e.g., redshift_ssh_tunnel).
    • Conn Type: Select the appropriate connection type for your database (e.g., Postgres for Redshift).
    • Host: Enter ssh-tunnel (this is the name of the service defined in the docker-compose.override.yml file).
    • Schema: Enter your Redshift schema name.
    • Login: Enter your Redshift username.
    • Password: Enter your Redshift password.
    • Port: Enter the port number used by your Redshift cluster (e.g., 5439).
    • Extra: Add any additional connection parameters required for your use case (e.g., {"sslmode": "require"}for SSL connections).
  5. Click “Save” to create the connection.

Now, when you create a new DAG, you can reference this connection by its unique identifier (e.g., redshift_ssh_tunnel). Here’s an example of how to use the connection in a DAG:

from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator
from datetime import datetime, timedelta

default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "start_date": datetime(2023, 3, 16),
    "email_on_failure": False,
    "email_on_retry": False,
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
}

dag = DAG(
    "redshift_example",
    default_args=default_args,
    description="An example DAG to interact with Redshift via SSH tunnel",
    schedule_interval=timedelta(days=1),
    catchup=False,
    max_active_runs=1,
)

query_task = PostgresOperator(
    task_id="query_redshift",
    postgres_conn_id="redshift_ssh_tunnel",
    sql="SELECT COUNT(*) FROM my_table;",
    dag=dag,
)

query_task

In this example, the PostgresOperator uses the redshift_ssh_tunnel connection to run a SQL query on the Redshift cluster. The ssh-tunnel hostname ensures that the connection is made through the SSH tunnel container.

Conclusion

In this blog post, we demonstrated how to develop locally with Astronomer and Airflow while connecting to a Redshift database or any other database hosted in a private subnet. By adding a tunnel container in your docker-compose.override.yml file and configuring your Airflow Connection to go through that tunnel, you can maintain a secure connection to your database resources without compromising the security of your AWS infrastructure.