Debug School

rakesh kumar
rakesh kumar

Posted on

Build a Simple Data Pipeline with Python and Docker

This example demonstrates how to develop a basic, modular data pipeline using Python scripts and containers with Docker. The pipeline consists of three steps:

Extract: Download data from a source.

Transform: Process/clean the data.

Load: Save the output to a database or a file.

Project Structure

data-pipeline/
│
├── docker-compose.yml
├── extract/
│   └── extract.py
├── transform/
│   └── transform.py
├── load/
│   └── load.py
├── data/         # Shared directory for passing files
│
├── requirements.txt

Enter fullscreen mode Exit fullscreen mode

extract/extract.py

import json
import requests

def fetch_data():
    url = 'https://jsonplaceholder.typicode.com/posts'
    response = requests.get(url)
    response.raise_for_status()
    with open('/data/raw_data.json', 'w') as f:
        json.dump(response.json(), f)
    print("Extracted data saved to /data/raw_data.json")

if __name__ == '__main__':
    fetch_data()
Enter fullscreen mode Exit fullscreen mode

transform/transform.py

import json

def transform():
    with open('/data/raw_data.json', 'r') as infile:
        records = json.load(infile)
    # Remove fields and keep id, title
    transformed = [{'id': r['id'], 'title': r['title']} for r in records]
    with open('/data/clean_data.json', 'w') as outfile:
        json.dump(transformed, outfile)
    print("Transformed data saved to /data/clean_data.json")

if __name__ == '__main__':
    transform()
Enter fullscreen mode Exit fullscreen mode

load/load.py

import json
import csv

def load():
    with open('/data/clean_data.json', 'r') as infile:
        records = json.load(infile)
    with open('/data/result.csv', 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=['id', 'title'])
        writer.writeheader()
        writer.writerows(records)
    print("Loaded data into /data/result.csv")

if __name__ == '__main__':
    load()
Enter fullscreen mode Exit fullscreen mode
  1. Docker Setup requirements.txt
requests
Enter fullscreen mode Exit fullscreen mode

Dockerfiles
All three steps use a simple Dockerfile (create in each subdirectory):

extract/Dockerfile, transform/Dockerfile, load/Dockerfile (identical):

FROM python:3.11-slim
WORKDIR /app
COPY ../../requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "extract.py"]  # change filename to transform.py/load.py for respective steps
Enter fullscreen mode Exit fullscreen mode
  1. docker-compose.yml
version: '3.9'
services:
  extract:
    build:
      context: ./extract
      dockerfile: Dockerfile
    volumes:
      - ./data:/data

  transform:
    build:
      context: ./transform
      dockerfile: Dockerfile
    volumes:
      - ./data:/data
    depends_on:
      - extract

  load:
    build:
      context: ./load
      dockerfile: Dockerfile
    volumes:
      - ./data:/data
    depends_on:
      - transform
Enter fullscreen mode Exit fullscreen mode
  1. How to Run the Pipeline Build all containers:
docker-compose build
Enter fullscreen mode Exit fullscreen mode

Run the pipeline services in order:

docker-compose run extract
docker-compose run transform
docker-compose run load
Enter fullscreen mode Exit fullscreen mode

Or, to chain all automatically, use a shell script (advanced).

  1. Result After running all steps, you’ll have:

data/raw_data.json (raw API data)

data/clean_data.json (transformed)

data/result.csv (CSV output ready for use)

  1. Customization Ideas Swap the extract step to read from a database or S3.

Change the load step to upload to a database instead of writing a CSV.

Add logging, error handling, or orchestration (e.g., Airflow) for advanced pipelines.

Without docker how to achive

Running Your Pipeline
You can run your ETL steps individually by:

python extract.py
python transform.py
python load.py
Enter fullscreen mode Exit fullscreen mode

Top comments (0)