A Beginner's Guide to Data Pipelines for Developers

Published on NeuralStack | MS · Software Engineering · Data Engineering
Introduction
Every modern application generates data. User events, logs, transactions, sensor readings, and API responses: the volume is constant and the pace is relentless. The challenge is not collecting that data; it is moving it reliably from where it originates to where it needs to be, in the right shape, at the right time.
That is precisely what a data pipeline does.
If you are a developer who has written application code but never formally worked in data engineering, this guide is for you. I will cover what data pipelines are, how they are structured, the vocabulary you need to navigate the field, and the architectural decisions you will encounter early on.
1. What Is a Data Pipeline?
A data pipeline is an automated sequence of steps that ingests data from one or more sources, transforms it, and delivers it to a destination, typically a database, a data warehouse, a message queue, or another service.
Think of it as an assembly line. Raw materials (raw data) enter at one end, pass through a series of stations (transformations), and exit as a finished product (clean, structured, usable data).
A minimal pipeline has three stages:
[Source] → [Transform] → [Destination]
In practice, pipelines can be far more complex: multiple sources, fan-out delivery to multiple destinations, conditional branching, error-handling paths, schema validation, deduplication, and enrichment steps, all stitched together and orchestrated on a schedule or in real time.
2. Core Concepts and Vocabulary
Before going further, it is worth establishing a shared vocabulary. Data engineering has its own lexicon, and misusing these terms creates confusion quickly.
2.1 ETL and ELT
ETL (Extract, Transform, Load) is the classic pipeline pattern:
Extract – pull data from the source system.
Transform – clean, reshape, and enrich it.
Load – write the result to the destination.
ELT (Extract, Load, Transform) reverses the last two steps. Raw data is loaded into the destination first, then transformed in place, a pattern made practical by the rise of cloud data warehouses like BigQuery, Snowflake, and Redshift, which can handle heavy computation at scale.
The choice between ETL and ELT is an architectural one. ETL keeps raw data out of the warehouse and enforces data quality upstream. ELT gives analysts and data scientists access to raw data faster, at the cost of potentially messier intermediate states.
2.2 Batch vs. Streaming
Batch pipelines process data in chunks at scheduled intervals – every hour, every night, every Monday morning. They are simpler to build, easier to debug, and well-suited to use cases where latency is not critical (nightly reporting, weekly aggregations, ML training jobs).
Streaming pipelines process data continuously, event by event, with latency measured in milliseconds to seconds. They are essential for use cases that require real-time responses: fraud detection, live dashboards, recommendation systems, and alerting infrastructure.
The choice is driven by the latency requirement of the downstream consumer, not by the volume of data.
2.3 Sources and Sinks
Source: Where data originates – a relational database, an external API, a file system, an event stream, a SaaS platform, a IoT sensor.
Sink: Where processed data is delivered – a data warehouse, a NoSQL store, a search index, a message queue, another service.
A single pipeline can have multiple sources and multiple sinks.
2.4 Idempotency
A pipeline step is idempotent if running it multiple times with the same input produces the same result without side effects. This is a critical property. Networks fail. Processes crash. Pipelines retry. If your transformations or load steps are not idempotent, retries will corrupt your data.
2.5 Backfilling
Backfilling is the process of re-running a pipeline over historical data, typically after fixing a bug, adding a new column, or changing business logic. Good pipeline design accounts for backfilling from the start, because the need for it is almost inevitable.
3. Anatomy of a Pipeline
Let us break down the three canonical stages in more detail.
3.1 Extraction
Extraction means reading data from a source. The technical approach depends on the source type:
| Source Type | Common Extraction Method |
|---|---|
| Relational DB (Postgres, MySQL) | Full table dump, incremental query (WHERE updated_at > last_run), CDC |
| REST API | Pagination, rate-limit-aware polling |
| File system (CSV, JSON, Parquet) | Direct read, globbing, manifest tracking |
| Event stream (Kafka, Kinesis) | Consumer group offset management |
| SaaS platforms (Stripe, Salesforce) | Official connectors or APIs |
Change Data Capture (CDC) deserves a special mention. Rather than querying a database repeatedly, CDC reads the database's transaction log (e.g., Postgres WAL, MySQL binlog) and captures every insert, update, and delete as an event stream. It is efficient, low-latency, and non-intrusive to the source system.
3.2 Transformation
Transformation is where business logic lives. Common operations include:
Cleaning: Removing nulls, standardizing date formats, correcting encodings.
Filtering: Dropping rows that do not meet criteria.
Enrichment: Joining with reference data (e.g., mapping user IDs to demographic attributes).
Aggregation: Computing sums, counts, averages over time windows.
Type casting: Ensuring columns have the correct data types.
Deduplication: Eliminating duplicate records from overlapping extraction windows.
Schema normalization: Flattening nested JSON, pivoting wide tables, unnesting arrays.
Transformations can be implemented in Python (Pandas, Polars), SQL (dbt is the dominant tool here), or using the processing engine's native API (Spark, Flink).
3.3 Loading
Loading writes transformed data to the destination. There are several loading strategies:
Full refresh: Truncate the destination table and reload from scratch. Simple but expensive for large datasets.
Incremental append: Only write new rows. Fast, but does not handle updates or deletes.
Upsert (merge): Insert new rows, update existing ones based on a primary key. The most common production pattern.
Slowly Changing Dimensions (SCD): A family of patterns for tracking how dimension data changes over time, relevant when historical accuracy of dimension attributes matters.
4. Orchestration
A pipeline step is just code. Orchestration is the system that decides when each step runs, in what order, with what dependencies, and what to do when something fails.
The dominant open-source orchestration tool is Apache Airflow, which models pipelines as Directed Acyclic Graphs (DAGs). Each node is a task; edges define dependencies.
# Minimal Airflow DAG skeleton
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def extract(): ...
def transform(): ...
def load(): ...
with DAG("my_pipeline", start_date=datetime(2025, 1, 1), schedule="@daily") as dag:
t1 = PythonOperator(task_id="extract", python_callable=extract)
t2 = PythonOperator(task_id="transform", python_callable=transform)
t3 = PythonOperator(task_id="load", python_callable=load)
t1 >> t2 >> t3 # dependency chain
Alternatives worth knowing:
| Tool | Positioning |
|---|---|
| Apache Airflow | Industry standard, battle-tested, verbose |
| Prefect | More Pythonic, dynamic task graphs, strong observability |
| Dagster | Asset-centric model, excellent for data assets with lineage |
| Temporal | General-purpose workflow engine, strong durability guarantees |
| Luigi | Older, simpler, still in use at some organisations |
5. Data Quality and Validation
Bad data entering a pipeline is worse than no data at all; it produces silently incorrect outputs that downstream consumers trust. Data quality checks should be treated as first-class pipeline citizens, not afterthoughts.
Schema validation ensures incoming data conforms to an expected structure:
# Using Pydantic for schema validation on ingested records
from pydantic import BaseModel, validator
from typing import Optional
from datetime import datetime
class UserEvent(BaseModel):
user_id: str
event_type: str
timestamp: datetime
payload: Optional[dict] = None
@validator("event_type")
def event_type_must_be_known(cls, v):
allowed = {"click", "purchase", "login", "logout"}
if v not in allowed:
raise ValueError(f"Unknown event type: {v}")
return v
Statistical validation detects anomalies: a column that is 0% null suddenly showing 40% null; a daily row count that drops by 90%; a revenue column with negative values. Tools like Great Expectations and dbt tests (not_null, unique, accepted_values, referential_integrity) formalize these checks.
6. Error Handling and Observability
Pipelines fail. The question is not whether, but when, and whether you will know about it before your stakeholders do.
6.1 Retry Logic
Transient failures (network timeouts, rate limits, temporary database unavailability) should trigger automatic retries with exponential backoff:
import time
import random
def fetch_with_retry(fetch_fn, max_attempts=5, base_delay=1.0):
for attempt in range(max_attempts):
try:
return fetch_fn()
except TransientError as e:
if attempt == max_attempts - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
6.2 Dead Letter Queues
Records that fail processing after exhausting retries should not be silently dropped. Route them to a Dead Letter Queue (DLQ) – a separate storage location where failed records are held for inspection, reprocessing, or alerting.
6.3 Observability
At minimum, instrument your pipelines with:
Row counts at each stage (extract count vs. load count should be reconcilable).
Latency metrics for each step.
Error rates and failure categorisation.
Data freshness – how old is the most recent record in the destination?
Tools like Datadog, Grafana + Prometheus, OpenTelemetry, and purpose-built data observability platforms (Monte Carlo, Bigeye) serve this space.
7. Storage Formats
The format in which data is stored and transmitted has significant performance implications.
| Format | Characteristics | Common Use |
|---|---|---|
| CSV | Human-readable, no schema, row-oriented | Simple file exchange, legacy systems |
| JSON / JSONL | Schema-flexible, verbose, row-oriented | API payloads, log data |
| Parquet | Columnar, compressed, schema-embedded | Analytical workloads, data lakes |
| Avro | Row-oriented, schema evolution support, compact | Kafka message serialisation |
| ORC | Columnar, Hive ecosystem | Hadoop-native analytical use |
| Delta Lake / Iceberg | Table formats on top of Parquet, ACID transactions, time travel | Lakehouse architectures |
For analytical pipelines, Parquet is the de facto standard. For streaming pipelines, Avro with a schema registry (Confluent Schema Registry) is widely adopted.
8. A Simple End-to-End Example
Let us walk through a concrete, minimal pipeline: fetching daily weather data from a public API and loading it into a PostgreSQL database.
import requests
import psycopg2
from datetime import date
# --- EXTRACT ---
def extract(city: str) -> dict:
url = f"https://api.open-meteo.com/v1/forecast"
params = {
"latitude": 48.8566,
"longitude": 2.3522,
"daily": ["temperature_2m_max", "temperature_2m_min"],
"timezone": "Europe/Paris",
"forecast_days": 1
}
response = requests.get(url, params=params, timeout=10)
response.raise_for_status()
return response.json()
# --- TRANSFORM ---
def transform(raw: dict) -> dict:
daily = raw["daily"]
return {
"date": date.fromisoformat(daily["time"][0]),
"temp_max_c": daily["temperature_2m_max"][0],
"temp_min_c": daily["temperature_2m_min"][0],
}
# --- LOAD ---
def load(record: dict, conn_str: str):
with psycopg2.connect(conn_str) as conn:
with conn.cursor() as cur:
cur.execute("""
INSERT INTO weather_daily (date, temp_max_c, temp_min_c)
VALUES (%(date)s, %(temp_max_c)s, %(temp_min_c)s)
ON CONFLICT (date) DO UPDATE
SET temp_max_c = EXCLUDED.temp_max_c,
temp_min_c = EXCLUDED.temp_min_c
""", record)
# --- ORCHESTRATE ---
if __name__ == "__main__":
raw = extract("Paris")
record = transform(raw)
load(record, "postgresql://user:pass@localhost/mydb")
print(f"Loaded: {record}")
This pipeline is synchronous, single-threaded, and scheduled via cron – perfectly adequate for low-frequency, low-volume use cases. Scaling it means introducing retries, parallel extraction, schema validation, and an orchestrator.
9. Choosing the Right Stack
There is no universal stack. The right tooling depends on volume, latency requirements, team size, and budget.
For Batch Pipelines
| Scale | Recommended Stack |
|---|---|
| Small / prototyping | Python scripts + cron + Postgres |
| Medium | Airflow or Prefect + dbt + Snowflake / BigQuery |
| Large | Spark (PySpark) + Airflow + Delta Lake / Iceberg |
For Streaming Pipelines
| Use Case | Recommended Stack |
|---|---|
| High-throughput event processing | Apache Kafka + Flink or Kafka Streams |
| Moderate latency, simpler ops | AWS Kinesis + Lambda or Flink |
| Python-native streaming | Bytewax, Faust |
Managed / Cloud-native Options
If operational overhead is a constraint, managed services reduce infrastructure burden significantly: AWS Glue (serverless ETL), Google Dataflow (managed Beam), Azure Data Factory, Fivetran / Airbyte (managed connectors), dbt Cloud.
10. Common Pitfalls
Developers new to data engineering routinely encounter the same failure modes. Knowing them in advance saves significant pain.
1. Not accounting for schema drift. Sources change their schemas without warning. Build schema validation and alerting into your extraction layer from day one.
2. Ignoring time zones. A timestamp stored without a time zone is an ambiguous timestamp. Standardise on UTC at the point of ingestion, always.
3. Treating pipelines as fire-and-forget. Pipelines that silently fail or silently produce wrong data are dangerous. Instrument everything.
4. Over-engineering early. A cron job and a Python script will serve you well until the data volume or latency requirement forces something more complex. Prefer simplicity until you cannot.
5. Not designing for backfill. At some point, you will need to reprocess historical data. Pipelines that cannot be parameterised by date range will require painful rewrites.
6. Mutating source data. Your pipeline should never write back to its source system unless that is an explicit, carefully considered design decision.
Conclusion
Data pipelines are foundational infrastructure. They sit behind every dashboard, every machine learning model, every alerting system, and every analytical query your organization relies on. Understanding them – even at a foundational level – makes you a more effective developer and positions you well to collaborate with or transition into data engineering roles.
The field rewards systems thinking: pipelines are not isolated scripts but components in a larger data ecosystem, and the decisions you make about reliability, observability, and schema management compound over time.
Start simple. Measure everything. Design for failure. And always think about the next person who will have to backfill your pipeline at 2 AM.
Written for NeuralStack | MS · Covering Software Engineering, AI Engineering & Cybersecurity





