Skip to main content

Command Palette

Search for a command to run...

A Beginner's Guide to Data Pipelines for Developers

Published
12 min read
A Beginner's Guide to Data Pipelines for Developers
M
Programmer, Creative and Tech Nerd. I build with code, write about AI and Software Development and explore the future of intelligent Agents.

Published on NeuralStack | MS · Software Engineering · Data Engineering


Introduction

Every modern application generates data. User events, logs, transactions, sensor readings, and API responses: the volume is constant and the pace is relentless. The challenge is not collecting that data; it is moving it reliably from where it originates to where it needs to be, in the right shape, at the right time.

That is precisely what a data pipeline does.

If you are a developer who has written application code but never formally worked in data engineering, this guide is for you. I will cover what data pipelines are, how they are structured, the vocabulary you need to navigate the field, and the architectural decisions you will encounter early on.


1. What Is a Data Pipeline?

A data pipeline is an automated sequence of steps that ingests data from one or more sources, transforms it, and delivers it to a destination, typically a database, a data warehouse, a message queue, or another service.

Think of it as an assembly line. Raw materials (raw data) enter at one end, pass through a series of stations (transformations), and exit as a finished product (clean, structured, usable data).

A minimal pipeline has three stages:

[Source] → [Transform] → [Destination]

In practice, pipelines can be far more complex: multiple sources, fan-out delivery to multiple destinations, conditional branching, error-handling paths, schema validation, deduplication, and enrichment steps, all stitched together and orchestrated on a schedule or in real time.


2. Core Concepts and Vocabulary

Before going further, it is worth establishing a shared vocabulary. Data engineering has its own lexicon, and misusing these terms creates confusion quickly.

2.1 ETL and ELT

ETL (Extract, Transform, Load) is the classic pipeline pattern:

  1. Extract – pull data from the source system.

  2. Transform – clean, reshape, and enrich it.

  3. Load – write the result to the destination.

ELT (Extract, Load, Transform) reverses the last two steps. Raw data is loaded into the destination first, then transformed in place, a pattern made practical by the rise of cloud data warehouses like BigQuery, Snowflake, and Redshift, which can handle heavy computation at scale.

The choice between ETL and ELT is an architectural one. ETL keeps raw data out of the warehouse and enforces data quality upstream. ELT gives analysts and data scientists access to raw data faster, at the cost of potentially messier intermediate states.

2.2 Batch vs. Streaming

Batch pipelines process data in chunks at scheduled intervals – every hour, every night, every Monday morning. They are simpler to build, easier to debug, and well-suited to use cases where latency is not critical (nightly reporting, weekly aggregations, ML training jobs).

Streaming pipelines process data continuously, event by event, with latency measured in milliseconds to seconds. They are essential for use cases that require real-time responses: fraud detection, live dashboards, recommendation systems, and alerting infrastructure.

The choice is driven by the latency requirement of the downstream consumer, not by the volume of data.

2.3 Sources and Sinks

  • Source: Where data originates – a relational database, an external API, a file system, an event stream, a SaaS platform, a IoT sensor.

  • Sink: Where processed data is delivered – a data warehouse, a NoSQL store, a search index, a message queue, another service.

A single pipeline can have multiple sources and multiple sinks.

2.4 Idempotency

A pipeline step is idempotent if running it multiple times with the same input produces the same result without side effects. This is a critical property. Networks fail. Processes crash. Pipelines retry. If your transformations or load steps are not idempotent, retries will corrupt your data.

2.5 Backfilling

Backfilling is the process of re-running a pipeline over historical data, typically after fixing a bug, adding a new column, or changing business logic. Good pipeline design accounts for backfilling from the start, because the need for it is almost inevitable.


3. Anatomy of a Pipeline

Let us break down the three canonical stages in more detail.

3.1 Extraction

Extraction means reading data from a source. The technical approach depends on the source type:

Source Type Common Extraction Method
Relational DB (Postgres, MySQL) Full table dump, incremental query (WHERE updated_at > last_run), CDC
REST API Pagination, rate-limit-aware polling
File system (CSV, JSON, Parquet) Direct read, globbing, manifest tracking
Event stream (Kafka, Kinesis) Consumer group offset management
SaaS platforms (Stripe, Salesforce) Official connectors or APIs

Change Data Capture (CDC) deserves a special mention. Rather than querying a database repeatedly, CDC reads the database's transaction log (e.g., Postgres WAL, MySQL binlog) and captures every insert, update, and delete as an event stream. It is efficient, low-latency, and non-intrusive to the source system.

3.2 Transformation

Transformation is where business logic lives. Common operations include:

  • Cleaning: Removing nulls, standardizing date formats, correcting encodings.

  • Filtering: Dropping rows that do not meet criteria.

  • Enrichment: Joining with reference data (e.g., mapping user IDs to demographic attributes).

  • Aggregation: Computing sums, counts, averages over time windows.

  • Type casting: Ensuring columns have the correct data types.

  • Deduplication: Eliminating duplicate records from overlapping extraction windows.

  • Schema normalization: Flattening nested JSON, pivoting wide tables, unnesting arrays.

Transformations can be implemented in Python (Pandas, Polars), SQL (dbt is the dominant tool here), or using the processing engine's native API (Spark, Flink).

3.3 Loading

Loading writes transformed data to the destination. There are several loading strategies:

  • Full refresh: Truncate the destination table and reload from scratch. Simple but expensive for large datasets.

  • Incremental append: Only write new rows. Fast, but does not handle updates or deletes.

  • Upsert (merge): Insert new rows, update existing ones based on a primary key. The most common production pattern.

  • Slowly Changing Dimensions (SCD): A family of patterns for tracking how dimension data changes over time, relevant when historical accuracy of dimension attributes matters.


4. Orchestration

A pipeline step is just code. Orchestration is the system that decides when each step runs, in what order, with what dependencies, and what to do when something fails.

The dominant open-source orchestration tool is Apache Airflow, which models pipelines as Directed Acyclic Graphs (DAGs). Each node is a task; edges define dependencies.

# Minimal Airflow DAG skeleton
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract(): ...
def transform(): ...
def load(): ...

with DAG("my_pipeline", start_date=datetime(2025, 1, 1), schedule="@daily") as dag:
    t1 = PythonOperator(task_id="extract", python_callable=extract)
    t2 = PythonOperator(task_id="transform", python_callable=transform)
    t3 = PythonOperator(task_id="load", python_callable=load)

    t1 >> t2 >> t3  # dependency chain

Alternatives worth knowing:

Tool Positioning
Apache Airflow Industry standard, battle-tested, verbose
Prefect More Pythonic, dynamic task graphs, strong observability
Dagster Asset-centric model, excellent for data assets with lineage
Temporal General-purpose workflow engine, strong durability guarantees
Luigi Older, simpler, still in use at some organisations

5. Data Quality and Validation

Bad data entering a pipeline is worse than no data at all; it produces silently incorrect outputs that downstream consumers trust. Data quality checks should be treated as first-class pipeline citizens, not afterthoughts.

Schema validation ensures incoming data conforms to an expected structure:

# Using Pydantic for schema validation on ingested records
from pydantic import BaseModel, validator
from typing import Optional
from datetime import datetime

class UserEvent(BaseModel):
    user_id: str
    event_type: str
    timestamp: datetime
    payload: Optional[dict] = None

    @validator("event_type")
    def event_type_must_be_known(cls, v):
        allowed = {"click", "purchase", "login", "logout"}
        if v not in allowed:
            raise ValueError(f"Unknown event type: {v}")
        return v

Statistical validation detects anomalies: a column that is 0% null suddenly showing 40% null; a daily row count that drops by 90%; a revenue column with negative values. Tools like Great Expectations and dbt tests (not_null, unique, accepted_values, referential_integrity) formalize these checks.


6. Error Handling and Observability

Pipelines fail. The question is not whether, but when, and whether you will know about it before your stakeholders do.

6.1 Retry Logic

Transient failures (network timeouts, rate limits, temporary database unavailability) should trigger automatic retries with exponential backoff:

import time
import random

def fetch_with_retry(fetch_fn, max_attempts=5, base_delay=1.0):
    for attempt in range(max_attempts):
        try:
            return fetch_fn()
        except TransientError as e:
            if attempt == max_attempts - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

6.2 Dead Letter Queues

Records that fail processing after exhausting retries should not be silently dropped. Route them to a Dead Letter Queue (DLQ) – a separate storage location where failed records are held for inspection, reprocessing, or alerting.

6.3 Observability

At minimum, instrument your pipelines with:

  • Row counts at each stage (extract count vs. load count should be reconcilable).

  • Latency metrics for each step.

  • Error rates and failure categorisation.

  • Data freshness – how old is the most recent record in the destination?

Tools like Datadog, Grafana + Prometheus, OpenTelemetry, and purpose-built data observability platforms (Monte Carlo, Bigeye) serve this space.


7. Storage Formats

The format in which data is stored and transmitted has significant performance implications.

Format Characteristics Common Use
CSV Human-readable, no schema, row-oriented Simple file exchange, legacy systems
JSON / JSONL Schema-flexible, verbose, row-oriented API payloads, log data
Parquet Columnar, compressed, schema-embedded Analytical workloads, data lakes
Avro Row-oriented, schema evolution support, compact Kafka message serialisation
ORC Columnar, Hive ecosystem Hadoop-native analytical use
Delta Lake / Iceberg Table formats on top of Parquet, ACID transactions, time travel Lakehouse architectures

For analytical pipelines, Parquet is the de facto standard. For streaming pipelines, Avro with a schema registry (Confluent Schema Registry) is widely adopted.


8. A Simple End-to-End Example

Let us walk through a concrete, minimal pipeline: fetching daily weather data from a public API and loading it into a PostgreSQL database.

import requests
import psycopg2
from datetime import date

# --- EXTRACT ---
def extract(city: str) -> dict:
    url = f"https://api.open-meteo.com/v1/forecast"
    params = {
        "latitude": 48.8566,
        "longitude": 2.3522,
        "daily": ["temperature_2m_max", "temperature_2m_min"],
        "timezone": "Europe/Paris",
        "forecast_days": 1
    }
    response = requests.get(url, params=params, timeout=10)
    response.raise_for_status()
    return response.json()

# --- TRANSFORM ---
def transform(raw: dict) -> dict:
    daily = raw["daily"]
    return {
        "date": date.fromisoformat(daily["time"][0]),
        "temp_max_c": daily["temperature_2m_max"][0],
        "temp_min_c": daily["temperature_2m_min"][0],
    }

# --- LOAD ---
def load(record: dict, conn_str: str):
    with psycopg2.connect(conn_str) as conn:
        with conn.cursor() as cur:
            cur.execute("""
                INSERT INTO weather_daily (date, temp_max_c, temp_min_c)
                VALUES (%(date)s, %(temp_max_c)s, %(temp_min_c)s)
                ON CONFLICT (date) DO UPDATE
                    SET temp_max_c = EXCLUDED.temp_max_c,
                        temp_min_c = EXCLUDED.temp_min_c
            """, record)

# --- ORCHESTRATE ---
if __name__ == "__main__":
    raw = extract("Paris")
    record = transform(raw)
    load(record, "postgresql://user:pass@localhost/mydb")
    print(f"Loaded: {record}")

This pipeline is synchronous, single-threaded, and scheduled via cron – perfectly adequate for low-frequency, low-volume use cases. Scaling it means introducing retries, parallel extraction, schema validation, and an orchestrator.


9. Choosing the Right Stack

There is no universal stack. The right tooling depends on volume, latency requirements, team size, and budget.

For Batch Pipelines

Scale Recommended Stack
Small / prototyping Python scripts + cron + Postgres
Medium Airflow or Prefect + dbt + Snowflake / BigQuery
Large Spark (PySpark) + Airflow + Delta Lake / Iceberg

For Streaming Pipelines

Use Case Recommended Stack
High-throughput event processing Apache Kafka + Flink or Kafka Streams
Moderate latency, simpler ops AWS Kinesis + Lambda or Flink
Python-native streaming Bytewax, Faust

Managed / Cloud-native Options

If operational overhead is a constraint, managed services reduce infrastructure burden significantly: AWS Glue (serverless ETL), Google Dataflow (managed Beam), Azure Data Factory, Fivetran / Airbyte (managed connectors), dbt Cloud.


10. Common Pitfalls

Developers new to data engineering routinely encounter the same failure modes. Knowing them in advance saves significant pain.

1. Not accounting for schema drift. Sources change their schemas without warning. Build schema validation and alerting into your extraction layer from day one.

2. Ignoring time zones. A timestamp stored without a time zone is an ambiguous timestamp. Standardise on UTC at the point of ingestion, always.

3. Treating pipelines as fire-and-forget. Pipelines that silently fail or silently produce wrong data are dangerous. Instrument everything.

4. Over-engineering early. A cron job and a Python script will serve you well until the data volume or latency requirement forces something more complex. Prefer simplicity until you cannot.

5. Not designing for backfill. At some point, you will need to reprocess historical data. Pipelines that cannot be parameterised by date range will require painful rewrites.

6. Mutating source data. Your pipeline should never write back to its source system unless that is an explicit, carefully considered design decision.


Conclusion

Data pipelines are foundational infrastructure. They sit behind every dashboard, every machine learning model, every alerting system, and every analytical query your organization relies on. Understanding them – even at a foundational level – makes you a more effective developer and positions you well to collaborate with or transition into data engineering roles.

The field rewards systems thinking: pipelines are not isolated scripts but components in a larger data ecosystem, and the decisions you make about reliability, observability, and schema management compound over time.

Start simple. Measure everything. Design for failure. And always think about the next person who will have to backfill your pipeline at 2 AM.


Written for NeuralStack | MS · Covering Software Engineering, AI Engineering & Cybersecurity

More from this blog

N

NeuralStack | MS

33 posts

NeuralStack | MS is your authoritative resource at the intersection of modern software development, AI engineering, and security engineering. I provide in-depth technical articles, industry trends, and actionable insights designed to empower developers and engineers in building secure, intelligent, and scalable systems.