The Data Pipeline Crisis
Part 1 of 4 in the series "From Spaghetti to Symphony"
Your Data Pipeline Is a Ticking Time Bomb
It's 3 AM. Your phone buzzes. The executive dashboard is showing zeros, the ML model hasn't updated in 12 hours, and somewhere in your 2,000-line user_processing.py
file, a database connection is hanging.
You open your laptop, squinting at the code you wrote six months ago. Or was it your teammate who wrote it? The git blame is a archaeological dig through layers of hotfixes, workarounds, and "temporary" solutions that became permanent.
Sound familiar?
You're not alone. After many years of building data pipelines, I've seen this pattern everywhere: We're not building data infrastructure. We're accumulating technical debt with a database connection.
The Real Cost of Pipeline Spaghetti
Let me show you what I mean. Here's a "simple" Dagster asset:
@asset
def user_data(context):
try:
# Where does this connection string come from? Who knows!
conn = create_engine("postgresql://...")
df = pd.read_sql("SELECT * FROM users", conn)
# Business logic? Error handling? It's all here!
if len(df) == 0:
context.log.error("No users found!")
return None # 💣 What happens downstream?
# Data cleaning mixed with everything else
df['email'] = df['email'].str.lower()
df = df[df['email'].str.contains('@')]
# Quick, let's save it somewhere!
df.to_parquet("/data/processed_users.parquet")
# Oh, and let's add some metrics while we're at it
context.log.info(f"Processed {len(df)} users")
send_metric_to_datadog("users.processed", len(df))
return df
except Exception as e:
context.log.error(f"Failed: {e}")
# Should we retry? Alert someone? ¯\_(ツ)_/¯
raise
@asset
def user_metrics(context, user_data):
# Pray that user_data isn't None 🙏
# Hope the schema hasn't changed 🤞
# Assume the parquet file exists 🎲
return calculate_metrics(user_data)
This isn't bad code written by bad engineers. This is what happens when we try to solve distributed systems problems with scripting patterns. It's what happens when our mental model doesn't match our problem space.
The Symptoms You're Probably Ignoring
If your pipelines look like this, you're experiencing:
- The 2 AM Debugging Session: One asset fails, and you need to trace through five files to understand why
- The New Hire Nightmare: "Just read through the codebase" turns into a three-week archaeological expedition
- The Hotfix Habit: Every bug fix is urgent, nothing is properly tested, and the fixes often cause new bugs
- The Refactor Fantasy: "We'll clean this up next quarter" (Narrator: They never did)
- The Scale Wall: Adding new data sources takes weeks, not hours
But here's the thing: Dagster already has the tools to solve this. We're just not using them right.
An Unexpected Wikipedia Rabbit Hole
Last month, I was supposed to be debugging a failed pipeline. Instead, I fell down a Wikipedia rabbit hole about functional programming. (We've all been there, right?)
I landed on an article about monads. If you don't know what monads are, don't worry - I barely understood them either. The Wikipedia description made them sound like abstract nonsense:
"A monad is a structure that combines program fragments and wraps their return values in a type with additional computation."
Yeah, thanks Wikipedia. Super helpful. 🙄
But then I saw an example in Haskell, and something clicked:
-- Haskell example (don't panic, we're not writing Haskell)
getData >>= processData >>= validateData >>= saveData
Each step happens only if the previous one succeeds. Errors propagate automatically. Context flows through the chain. Side effects are isolated and explicit.
Wait a minute... This is exactly what we want our data pipelines to do.
The Aha! Moment That Changed Everything
Here's what hit me like a ton of bricks:
Dagster assets are trying to be monads. They're solving the exact same problem:
- Composing operations while managing context
- Handling errors in a predictable way
- Making side effects explicit
- Building complex flows from simple pieces
We've been writing procedural scripts inside a functional framework. No wonder it feels awkward!
What This Actually Looks Like
Let me show you the same pipeline, rebuilt with this insight:
# Pure business logic - no I/O, no side effects
def clean_user_emails(df: pd.DataFrame) -> pd.DataFrame:
"""Pure function: data in, data out"""
return (df
.assign(email=lambda x: x['email'].str.lower())
.query("email.str.contains('@')")
)
# Asset handles context and composition
@asset(
group_name="user_analytics",
auto_materialize_policy=AutoMaterializePolicy.eager()
)
def clean_users(context: AssetExecutionContext, raw_users: pd.DataFrame) -> pd.DataFrame:
"""Asset manages context, function manages logic"""
# Explicit error handling
if raw_users.empty:
raise ValueError("No users to process")
# Pure transformation
cleaned = clean_user_emails(raw_users)
# Rich metadata for observability
context.add_output_metadata({
"row_count": len(cleaned),
"null_emails": cleaned['email'].isna().sum(),
"cleaning_ratio": len(cleaned) / len(raw_users) * 100
})
return cleaned
# Downstream assets are simple and confident
@asset(group_name="user_analytics")
def user_metrics(clean_users: pd.DataFrame) -> pd.DataFrame:
"""No defensive programming needed - dependencies are explicit"""
return calculate_metrics(clean_users)
Look at the difference:
- Business logic is separate from infrastructure -
clean_user_emails
can be unit tested without Dagster - Dependencies are explicit -
user_metrics
knows exactly what it's getting - Context is managed by the framework - We're not manually handling connections and logging
- Errors have clear boundaries - Each asset succeeds or fails as a unit
The Pattern Hidden in Plain Sight
I started recognizing that the problems functional programming solves are the same problems we face in data engineering:
- Composition: How do we build complex pipelines from simple pieces?
- Context: How do we manage connections, credentials, and configuration without coupling?
- Errors: How do we handle failures gracefully and predictably?
- Side Effects: How do we make I/O and external calls explicit and testable?
Modern Dagster (with features like ConfigurableResource, asset checks, and auto-materialization) gives us the tools. We just need to change how we think about using them.
Why This Matters More Than You Think
This isn't just about cleaner code. When you start thinking about your pipelines this way:
- Testing becomes trivial because business logic is pure functions
- Debugging gets 10x faster because each piece has a single responsibility
- New team members onboard in days, not weeks because the patterns are consistent
- Scaling becomes linear, not exponential because composition is built-in
- 3 AM alerts become rare because error handling is explicit and predictable
The Journey Ahead
This was just the beginning of my journey from spaghetti pipelines to composable data infrastructure. Over the next three posts, I'll show you:
- Part 2: The three fundamental patterns that make pipelines composable (with real code you can use today)
- Part 3: How to build your own pipeline abstractions using these patterns
- Part 4: Integration with dbt, scaling with your team, and the bigger picture
But here's the best part: You don't need to understand monads to use these patterns. You just need to recognize that the solution to pipeline spaghetti isn't more scripts - it's better composition.
Your Next Step
Look at one of your most complex pipelines. Can you identify where:
- Business logic is tangled with infrastructure?
- Error handling is implicit or missing?
- Dependencies are unclear?
- Testing would be nearly impossible?
That's where you start. Pick one asset, separate the pure logic from the context management, and see what happens.
Trust me, once you see your pipelines this way, you can't unsee it. And your 3 AM self will thank you.
Next up in Part 2: "The Three Pillars of Composable Data Pipelines" - where we dive deep into the specific patterns that transform spaghetti into symphony.
not made by a 🤖