My Database is My Application: Rethinking Webhook Logic with DuckDB and SQL
Imagine you need to build a system for processing incoming webhooks. You're probably picturing a familiar setup: a lightweight web server (FastAPI, Flask, Express.js, etc.), some Python (or Node.js, or Go) handlers to parse JSON, a sprinkle of business logic, and then maybe persisting data to a traditional database like PostgreSQL or MySQL. Perhaps you'd toss events onto a message queue like Kafka or RabbitMQ for downstream processing. Standard stuff, right?
Well, I’ve been experimenting with a different approach. What if I told you I let SQL handle almost all of it?
I load the incoming webhook JSON directly into DuckDB. Then, I run a SQL transform—a query dynamically defined and stored in the database itself—to reshape the data. And yes, SQL even helps decide where that webhook payload gets routed next.
Sounds a bit... unconventional? Maybe. But it’s an attempt to solve some persistent challenges I've encountered, and it opens up a fascinating way to think about data, logic, and infrastructure. This isn't just a backend API; the project also includes a simple web UI to manage these configurations visually, making the whole system tangible. Code available here (opens in a new tab)
The Familiar Friction: Why I Started Questioning "Normal"
I've built and maintained my fair share of webhook gateways and integration layers. A few common pain points kept cropping up:
- The Code Bottleneck: Every time a new webhook source or a slight variation in transformation logic was needed, it meant code changes. A new handler, a modified Pydantic model, a redeploy. I, or my team, became the bottleneck.
- Ownership Tangles: Giving multiple teams or users the ability to define their own webhook transformations often meant granting them broader application deployment privileges, or setting up complex, isolated microservices for each. Neither felt quite right.
- Repetitive Logic: So many webhook handlers do similar things: pick a few fields, rename some keys, maybe enrich with a static lookup. Writing Python for
output_payload['userName'] = input_payload['user']['login']
over and over felt like I was missing a more declarative way. - Observability Challenges: Understanding why a specific webhook failed or was transformed in a certain way often involved digging through application logs, which could be scattered or inconsistently formatted.
These issues led me to wonder: could there be a more data-centric, self-service approach?
Shifting the Lens: DuckDB as a Runtime Engine
This is where my mental model for DuckDB started to shift.
I began to see DuckDB not just as an embedded analytical database, but as an in-process SQL engine capable of acting as a programmable substrate for data transformation at runtime.
Think about it:
- We trust SQL to define data contracts and schemas (hello, dbt users!).
- We trust SQL to express complex aggregations, filters, and joins for analytics.
- SQL is declarative, its behavior is (usually) inspectable, and its syntax is widely understood.
So, why not trust SQL to express the live transformation and routing logic for something like webhooks? The key insight for me was this: if the logic is defined as data (SQL queries stored in a table), and the engine to execute that logic is also data-aware (DuckDB), then you unlock a powerful new paradigm.
The Core Idea: The Database is the Application Logic
In the system I've built, the database isn't just a passive recipient of data; it's the active processing pipeline.
Incoming webhooks become temporary views (connection.register
), routing rules are WHERE
clauses (filter_query
), transformations are SELECT
expressions (transform_query
), and audit logs are just regular tables (raw_events
, transformed_events
).
The database isn't just a data sink; it's the pipeline itself. It’s almost like turning your middleware inside out.
Why This "Weird" Approach is Powerful
This might seem like an academic exercise, but it unlocks some genuinely compelling properties.
✅ Declarative Self-Service: Teams can define and update their webhook transformations and filters by submitting SQL queries via an API (or the provided UI!). A full set of API endpoints allows managing webhooks (register, update, list, activate/deactivate, delete), reference tables, and UDFs.
No application redeploys are needed for logic changes. They own their logic.
✅ Queryable Runtime State: Every raw event, every transformation attempt, its success or failure, and the final payload are logged within DuckDB.
Need to debug why a webhook for /github-events
isn't working? The /query
endpoint or direct DB access lets you run SQL like:
SELECT r.timestamp, r.payload, t.success, t.response_code, t.response_body
FROM raw_events r
LEFT JOIN transformed_events t ON r.id = t.raw_event_id
WHERE r.source_path = '/github-events'
ORDER BY r.timestamp DESC LIMIT 5;
This level of direct, SQL-based introspection into the runtime behavior is incredibly powerful. Specific API endpoints also exist to view recent events and drill down into individual event details.
✅ Composable & Extensible Logic:
Need to enrich webhook data with user details?
Upload a users.csv
or JSON file as a reference table (/upload_table
) and JOIN
it in your transform query using its namespaced name (e.g., ref_webhook123_users
).
Need a complex string manipulation or a call to an external validation service that SQL can't easily do?
Register a Python UDF (/register_udf
) and call it directly from your SQL using its namespaced name (e.g., udf_webhook123_extract_jira_key
).
# Example Python UDF registered with the system
def extract_jira_key(text: str) -> str | None: # Type hints help DuckDB!
import re
if not text: return None
match = re.search(r"[A-Z]+-\d+", text)
return match.group(0) if match else None
-- Using the UDF in a transform query
SELECT
upper(payload.repository.name) AS repo_name,
udf_webhook123_extract_jira_key(payload.commit.message) AS jira_issue
FROM {{payload}};
✅ Portable Artifact & Simplified Operations: The entire state of the gateway - configurations, reference data, UDF definitions, and logs - can live inside a single DuckDB file (though the path is configurable via the DUCKDB_PATH environment variable). Backups are simple file copies.
✅ "Infra as Data": The behavior of the system is defined by data (SQL queries, reference tables) stored within the database, rather than by imperative code deployed in an application layer.
✅ Robust Implementation: Under the hood, it uses standard Python async capabilities (asyncio, FastAPI) along with thread pools and locking (ThreadPoolExecutor, asyncio.Lock) to handle requests concurrently while ensuring safe access to the DuckDB database. Database operations are wrapped in transactions for atomicity.
Addressing the Skepticism: "But Why Not Just... Python?"
I can hear the questions already: "This sounds overengineered!" or "Why not just write a normal Python app with a config file?" or "Isn't SQL for, you know, querying?"
And those are valid points if your primary goal is just to write a simple webhook handler quickly. But this approach isn't trying to optimize for the simplest possible "hello world" webhook. It's aiming to solve a deeper set of problems around making runtime behavior highly configurable via data, and shifting who owns the transformation logic. It's about enabling a more federated, self-service model.
A Concrete Example: Let's Walk Through It
Imagine we want to process GitHub push events for the main branch only. Here's how it works:
Register the Webhook
We POST
a configuration to the /register
endpoint (or use the UI!). This definition includes the source_path
, target destination_url
, the SQL transform_query
, an optional SQL filter_query
, and an owner.
curl -X POST http://localhost:8000/register \
-H "X-API-Key: default_key" \
-H "Content-Type: application/json" \
-d '{
"source_path": "/github-events",
"destination_url": "https://example.com/webhook-handler",
"transform_query": "SELECT repository.name AS repo_name, sender.login AS sender, type AS event_type FROM {{payload}}",
"filter_query": "type IN (\'PushEvent\', \'PullRequestEvent\')",
"owner": "patricktrainer"
}'
This tells the gateway: "Listen on /github-pushes. If an event comes for the main branch (filter_query), transform it using this SQL (transform_query), and send it to the echo endpoint."
Receive a Payload
GitHub (or another service) sends a webhook POST request to http://localhost:8000/github-pushes (opens in a new tab) with a JSON body:
{
"ref": "refs/heads/main",
"repository": { "full_name": "my-org/my-repo" },
"pusher": { "name": "patricktrainer" },
"commits": [
{ "id": "abc", "message": "feat: new feature" },
{ "id": "def", "message": "fix: bug fix" }
]
}
The gateway logs this raw event immediately.
SQL Does the Work
In the background, the gateway processes the event:
- Filtering: It runs the filter_query against the payload (represented as a temporary view accessible via
{payload}
). Since payload->>'ref' is 'refs/heads/main', the filter passes. - Transformation: It executes the transform_query:
SELECT
payload.repository.full_name AS repo,
payload.pusher.name AS pusher_name,
COUNT(payload.commits.id) AS commit_count -- DuckDB JSON!
FROM {{payload}} -- payload is replaced by the temp view name
This SQL, running inside DuckDB, produces the desired output JSON:
{
"repo": "my-org/my-repo",
"pusher_name": "patricktrainer",
"commit_count": 2
}
Forwarding & Auditing
The gateway sends the transformed payload to the configured destination_url (http://localhost:8000/echo-webhook (opens in a new tab) in this case). It then logs the outcome (success/failure, response code, response body) in the transformed_events table, linking it back to the original raw_events entry.
This simple flow, driven entirely by SQL configurations stored in the database, demonstrates the core power.
Where Could This Go? Open Possibilities
Is this specific implementation "production-ready" for every conceivable use case? Maybe, maybe not – though features like configurable workers (DUCKDB_MAX_WORKERS
) and robust DB handling move it closer.
It's primarily an exploration. But it’s an exciting one because it forces us to invert the typical application stack.
I see this pattern as a prototype for thinking differently about:
- Data-driven control planes: Where system behavior is dynamically configured through data.
- Self-service data pipelines: Empowering users to build their own simple ETLs with SQL, reference data, and custom Python logic.
- Extensible audit layers: Creating rich, queryable logs of system activity by default.
- Dynamic caching layers: Imagine SQL defining cache invalidation or transformation logic.
- Per-tenant customization: Using SQL to define tenant-specific routing or data shaping.
It’s not just about this specific webhook gateway. It’s about advocating for a way of thinking: that databases, especially modern ones like DuckDB, can be potent runtime logic engines, not just passive storage layers. And that sometimes, "weird" architectures unlock remarkably useful properties once you reframe the problem you're trying to solve.
I'm curious to hear what others think. What are the pitfalls I haven't considered? What other use cases could this "database as runtime" pattern unlock? Let me know!
not made by a 🤖