Data infrastructure reference

The Zero-Dependency Data Architecture Blueprint

A practitioner's reference for building analytics infrastructure without managed platforms, proprietary formats, or vendor dependency.

AuthorNauman Shahid

RolePrincipal Data Engineer

TypeReference document

The vulnerability becomes visible at the third-year renewal. The quote is forty percent higher. An engineering lead three years into a Snowflake contract calculates the migration cost, realises it exceeds the price increase, and signs the renewal. That is the exact moment an architecture transitions from a technical choice into a vendor capture mechanism.

This document describes the architecture that prevents that moment from arriving. Not because vendor tools are without merit, but because the decision to use them should be deliberate, and the exit path should always be known before the contract is signed. Most organisations discover the exit path does not exist at the moment they most need it.

The Mechanics of Vendor Capture

Managed platforms engineer their pricing to reward initial adoption and penalise scale. An engineering team building on Snowflake, BigQuery, or Databricks pays a premium for an ecosystem built to make egress painful. The pipeline relies on proprietary SQL dialects. The storage sits in formats readable only through the vendor's compute layer. The budget is tied to an arbitrary credit system with no ceiling.

The failure modes are consistent across organisations. The success penalty: as data volume grows, the bill scales exponentially. The zombie cost: the organisation pays for compute availability even when nothing is running. Black box tuning: the underlying hardware is inaccessible, leaving credit inflation as the only available lever when performance degrades.

None of this is accidental. The vendor's objective is to ensure that the cost of leaving always exceeds the cost of staying. They achieve this through proprietary storage formats, egress fees, and the slow accumulation of pipeline logic written against their specific SQL dialect. Three years in, an engineering team cannot move their data without moving the business.

Decoupling the components from the beginning eliminates that leverage. Running compute on commodity hardware and storing data in open, universally readable formats does not require sacrifice in performance or capability. DuckDB, Parquet, and Python serve ninety percent of analytical workloads with no vendor dependency whatsoever.

The core thesis: DuckDB provides columnar, in-process analytics at speeds that compete directly with managed warehouses. Parquet provides compressed, open-standard storage. Python binds them. A complete data warehouse can execute on a forty-dollar virtual machine and scale to process terabytes on a dedicated server at a fraction of the managed service cost. Moving providers requires copying files. Decommissioning requires stopping a process. The vendor has no leverage.

The Five-Layer Stack

A zero-dependency architecture separates five distinct functional layers. Each layer uses open-source tooling, stores data in open formats, and can be replaced independently without disrupting the others. That replaceability is the point: it is what a vendor-dependent stack deliberately prevents.

Layer 1: Ingestion Pull from sources, write to open storage. No per-row pricing, no managed connectors.

Tools: Python, requests/httpx, standard libraries
Replaces: Fivetran, Airbyte (managed), Stitch
Why: Managed ingestion tools charge per row or per megabyte transferred. At scale, this becomes the most expensive line item on the data infrastructure invoice. Python scripts are version-controlled, fully customisable, and cost nothing beyond compute time. Any API, any database, any custom source: a Python script handles it without a connector marketplace.
Output pattern: Scripts extract from source, write directly to local disk or object storage as Parquet files partitioned by date.

Layer 2: Storage Open format. Readable by any engine. Zero egress fees.

Tools: Local disk, AWS S3, or Cloudflare R2 with Apache Parquet
Replaces: Snowflake internal storage, Redshift managed storage, BigQuery native storage
Why: Parquet is compressed, columnar, and readable by every analytical engine in existence without conversion. Cloudflare R2 has no egress fees. The data sits in a format you own entirely: if R2 raises prices tomorrow, you copy the files to S3 and update one config value.
Partition pattern: s3://data-lake/source_system/entity/year=2024/month=10/

Layer 3: Transformation Query Parquet directly. No data loading, no warehouse compute fees.

Tools: DuckDB, dbt Core
Replaces: Snowflake compute, BigQuery, Databricks, dbt Cloud
Why: DuckDB queries Parquet files directly from object storage over HTTP. There is no preliminary loading step, no warehouse to provision, no compute cluster to keep warm. dbt Core enforces software engineering discipline on SQL transformations: tests, documentation, lineage, all version-controlled in git. Together they produce a full warehouse capability at infrastructure cost only.

Layer 4: Orchestration Most teams do not need a distributed DAG runner.

Tools: cron, systemd timers, or GitHub Actions
Replaces: Airflow (managed), Prefect Cloud, Dagster Cloud
Why: A shell script on a timer handles ninety-nine percent of what a managed orchestration platform handles, with none of the operational overhead and none of the monthly fee. Add Airflow only when the pipeline complexity genuinely requires it, not because it appeared in an architecture diagram.

Layer 5: Serving Version-controlled reports or a self-hosted BI tool. Both beat Looker's licence.

Tools: Evidence.dev (code-based), Metabase CE (self-hosted)
Replaces: Looker, Tableau, Mode, Hex
Why: Evidence.dev compiles SQL into static HTML dashboards, version-controlled in git, deployed as a static site. No server, no licence, no vendor. Metabase Community Edition is a full-featured BI tool that connects directly to DuckDB and runs on the same VM as the warehouse.

Example directory structure

/opt/data-stack/
├── ingestion/
│   ├── scripts/
│   │   ├── extract_stripe.py
│   │   └── extract_postgres.py
│   └── requirements.txt
├── dbt_project/
│   ├── models/
│   │   ├── staging/
│   │   └── marts/
│   └── dbt_project.yml
├── data/
│   ├── raw/           (Parquet files)
│   └── duckdb/
│       └── warehouse.db
└── orchestration/
    └── run_daily_pipeline.sh

Sample ingestion script (`extract_stripe.py`)

import os
import requests
import pandas as pd
from datetime import datetime

STRIPE_KEY = os.getenv("STRIPE_KEY")
TARGET_DIR = "/opt/data-stack/data/raw/stripe/charges/"

def extract_charges():
    headers = {"Authorization": f"Bearer {STRIPE_KEY}"}
    url = "https://api.stripe.com/v1/charges?limit=100"

    response = requests.get(url, headers=headers)
    data = response.json()["data"]

    if not data:
        return

    df = pd.DataFrame(data)

    today = datetime.now().strftime("%Y-%m-%d")
    os.makedirs(f"{TARGET_DIR}/date={today}", exist_ok=True)
    file_path = f"{TARGET_DIR}/date={today}/charges.parquet"

    df.to_parquet(file_path, engine="pyarrow", index=False)
    print(f"Extracted {len(df)} records to {file_path}")

if __name__ == "__main__":
    extract_charges()

Building the Stack

What follows is the implementation sequence on a single Ubuntu 22.04 LTS virtual machine. This is not a tutorial. It is the exact sequence used in a real build. Adapt the VM tier to the data volume; every other step is constant.

VM sizing by data volume

Tier	Data volume	Specification	Monthly cost (approx.)
Startup	Under 50 GB	2 vCPUs, 8 GB RAM, 100 GB NVMe	$20–$40
Scale-up	50 GB – 500 GB	8 vCPUs, 32 GB RAM, 1 TB NVMe	$100–$150
Enterprise	500 GB and above	16 vCPUs, 128 GB RAM, 4 TB NVMe	$400–$600

Step 1: Environment setup

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip python3-venv wget curl git unzip

sudo mkdir -p /opt/data-stack
sudo chown -R $USER:$USER /opt/data-stack
cd /opt/data-stack

python3 -m venv venv
source venv/bin/activate
pip install requests pandas pyarrow dbt-duckdb

Step 2: DuckDB CLI

wget https://github.com/duckdb/duckdb/releases/download/v0.10.0/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip -d /usr/local/bin
rm duckdb_cli-linux-amd64.zip

Step 3: Initial warehouse configuration

DuckDB is a file, not a service. There is no daemon to start. Interaction is via the CLI or Python.

duckdb /opt/data-stack/data/duckdb/warehouse.db

-- Create a view over raw Parquet files. No loading required.
CREATE SCHEMA raw;
CREATE VIEW raw.stripe_charges AS
SELECT * FROM read_parquet('/opt/data-stack/data/raw/stripe/charges/*/*.parquet');
.quit

Step 4: dbt configuration

mkdir /opt/data-stack/dbt_project
cd /opt/data-stack/dbt_project
dbt init my_warehouse

Configure ~/.dbt/profiles.yml:

my_warehouse:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /opt/data-stack/data/duckdb/warehouse.db
      threads: 4

Example dbt model (models/marts/fct_charges.sql):

{{ config(materialized='table') }}

SELECT
    id AS charge_id,
    amount / 100.0 AS amount_usd,
    status,
    created AS created_at
FROM {{ source('raw', 'stripe_charges') }}
WHERE status = 'succeeded'

Step 5: Serving layer

sudo apt install -y docker.io docker-compose

docker run -d -p 3000:3000 \
  -v /opt/data-stack/data/duckdb:/metabase-data \
  --name metabase metabase/metabase:latest

Install the community DuckDB driver and mount it into the Metabase plugins directory. Configuration documentation: metabase.com/docs.

Step 6: End-to-end pipeline script

Save as /opt/data-stack/orchestration/run_daily_pipeline.sh:

#!/bin/bash
set -e
source /opt/data-stack/venv/bin/activate

echo "1. Ingestion..."
python /opt/data-stack/ingestion/scripts/extract_stripe.py

echo "2. Transformations..."
cd /opt/data-stack/dbt_project
dbt run

echo "Pipeline complete."

Add to crontab (crontab -e):

0 2 * * * /opt/data-stack/orchestration/run_daily_pipeline.sh >> /var/log/data_pipeline.log 2>&1

Migration from a Managed Warehouse

The question is not whether to migrate. It is whether to migrate now, when the cost is known and the leverage is yours, or at the third-year renewal, when neither is true. The sequence below runs the migration in parallel with the existing stack. There is no cutover risk because nothing is removed until the replacement has been verified.

Phase 1: Shadow ingestion

Leave the current stack fully operational. Set up the VM and begin writing Python ingestion scripts that write Parquet files alongside the existing managed syncs. The goal is not to replace anything yet. It is to establish data parity in the raw layer and confirm that every source can be extracted without the managed connector.

Phase 2: SQL translation

Port the dbt models to DuckDB-compatible SQL. DuckDB's dialect is close to PostgreSQL: the translation from Snowflake requires attention to a handful of specific functions (ARRAY_AGG behaviour, some window function extensions, a few date arithmetic differences). Run dbt against both systems simultaneously until outputs match.

Phase 3: BI cutover

Connect a parallel Metabase or Evidence instance to the DuckDB database. Rebuild the five most-used dashboards. At this stage most teams discover that DuckDB returns query results faster than the managed warehouse they were paying for. That observation is useful in the decommission conversation.

Phase 4: Decommission

Disable the managed ingestion connectors. Terminate the Snowflake or BigQuery instance. Cancel the dbt Cloud subscription. The monthly infrastructure invoice drops to the VM cost. The leverage reversal is permanent: from this point, the architecture is yours.

Cost Comparison by Scale

The numbers below are based on published pricing as of mid-2025. The managed stack figures use conservative estimates: actual costs at scale are typically higher due to usage-based billing volatility.

Tier 1: Startup (under 50 GB, daily syncs)

Component	Vendor stack	Zero-dependency stack
Ingestion	Fivetran: $300/mo	Python scripts: $0
Storage and compute	Snowflake: $200/mo	VM + disk: $40/mo
Transformation	dbt Cloud: $100/mo	dbt Core: $0
Serving	Metabase Cloud: $85/mo	Metabase CE: $0
Total	$685/mo	$40/mo

Annual saving: $7,740

At startup scale, the zero-dependency stack costs less than a single Fivetran connector.

Tier 2: Scale-up (50–500 GB, hourly syncs)

Component	Vendor stack	Zero-dependency stack
Ingestion	Fivetran: $1,200/mo	Python scripts: $0
Storage and compute	Snowflake: $1,500/mo	High-memory VM: $150/mo
Transformation	dbt Cloud: $200/mo	dbt Core: $0
Serving	Looker: $1,500/mo	Metabase CE: $0
Total	$4,400/mo	$150/mo

Annual saving: $50,820

The serving layer alone (Looker) costs more each month than the entire zero-dependency stack.

Tier 3: Enterprise (500 GB and above, near-real-time)

Component	Vendor stack	Zero-dependency stack
Ingestion	Fivetran: $3,000/mo	Python scripts: $0
Storage and compute	Snowflake: $5,000+/mo	Bare metal server: $450/mo
Transformation	dbt Cloud: $500/mo	dbt Core: $0
Serving	Looker: $3,000/mo	Metabase CE: $0
Total	$11,500+/mo	$450/mo

Annual saving: $131,400+

At enterprise scale, the managed stack costs more per month than the zero-dependency stack costs in a year.

Maintenance

Running self-hosted infrastructure requires an operations posture. The posture is minimal: keep the data backed up, monitor the pipeline, upgrade on a schedule. Nothing here requires a dedicated operations team.

Backup

Raw data already exists in Parquet on disk or in object storage: those files are the backup. The DuckDB warehouse file and Metabase application database need a separate backup job.

# Push the full data stack directory to a cold storage bucket hourly
aws s3 sync /opt/data-stack/ s3://my-cold-backup-bucket/data-stack/ --delete

Monitoring

Do not install Datadog. It contradicts the architecture's purpose. Use a shell trap to send a Slack webhook if the daily pipeline script fails. Install Netdata (one-line installation) for a local CPU, RAM, and disk I/O dashboard that sends nothing to a third party.

Upgrade cadence

DuckDB: Quarterly. Review the changelog for breaking SQL changes before upgrading.
dbt Core: Semi-annually.
Operating system: apt update monthly during a low-usage window.

When the stack strains under load

If the VM runs out of memory during dbt transformations: vertical scaling takes two minutes (shut down, resize in the provider console, restart). If query performance degrades on large datasets: ensure dbt is running incremental models, not full refreshes, and move raw Parquet files from local disk to object storage for DuckDB to query remotely via HTTPFS extension. These are the two adjustments that resolve ninety percent of performance problems at scale.

Nauman Shahid builds zero-dependency data infrastructure for organisations in the UAE and the Gulf region. If your current data infrastructure cost or vendor exposure is a concern, diagnostic audit engagements are available through www.mindflex.tech.

Want the vendor lock-in diagnostic instrument?

The Vendor Lock-In Audit →

These documents come from live diagnostic work. If your data infrastructure, vendor exposure, or compliance posture needs attention:

Discuss a diagnostic engagement →

← Back to the Reference Library · Ko-fi

The Zero-Dependency Data Architecture Blueprint

The Mechanics of Vendor Capture

The Five-Layer Stack

Layer 1: Ingestion Pull from sources, write to open storage. No per-row pricing, no managed connectors.

Layer 2: Storage Open format. Readable by any engine. Zero egress fees.

Layer 3: Transformation Query Parquet directly. No data loading, no warehouse compute fees.

Layer 4: Orchestration Most teams do not need a distributed DAG runner.

Layer 5: Serving Version-controlled reports or a self-hosted BI tool. Both beat Looker's licence.

Example directory structure

Sample ingestion script (extract_stripe.py)

Building the Stack

VM sizing by data volume

Step 1: Environment setup

Step 2: DuckDB CLI

Step 3: Initial warehouse configuration

Step 4: dbt configuration

Step 5: Serving layer

Step 6: End-to-end pipeline script

Migration from a Managed Warehouse

Phase 1: Shadow ingestion

Phase 2: SQL translation

Phase 3: BI cutover

Phase 4: Decommission

Cost Comparison by Scale

Tier 1: Startup (under 50 GB, daily syncs)

Tier 2: Scale-up (50–500 GB, hourly syncs)

Tier 3: Enterprise (500 GB and above, near-real-time)

Maintenance

Backup

Monitoring

Upgrade cadence

When the stack strains under load

Sample ingestion script (`extract_stripe.py`)