Data infrastructure reference

The Zero-Dependency Data Architecture Blueprint

A practitioner's reference for building analytics infrastructure without managed platforms, proprietary formats, or vendor dependency.

AuthorNauman Shahid
RolePrincipal Data Engineer
TypeReference document

The vulnerability becomes visible at the third-year renewal. The quote is forty percent higher. An engineering lead three years into a Snowflake contract calculates the migration cost, realises it exceeds the price increase, and signs the renewal. That is the exact moment an architecture transitions from a technical choice into a vendor capture mechanism.

This document describes the architecture that prevents that moment from arriving. Not because vendor tools are without merit, but because the decision to use them should be deliberate, and the exit path should always be known before the contract is signed. Most organisations discover the exit path does not exist at the moment they most need it.

The Mechanics of Vendor Capture

Managed platforms engineer their pricing to reward initial adoption and penalise scale. An engineering team building on Snowflake, BigQuery, or Databricks pays a premium for an ecosystem built to make egress painful. The pipeline relies on proprietary SQL dialects. The storage sits in formats readable only through the vendor's compute layer. The budget is tied to an arbitrary credit system with no ceiling.

The failure modes are consistent across organisations. The success penalty: as data volume grows, the bill scales exponentially. The zombie cost: the organisation pays for compute availability even when nothing is running. Black box tuning: the underlying hardware is inaccessible, leaving credit inflation as the only available lever when performance degrades.

None of this is accidental. The vendor's objective is to ensure that the cost of leaving always exceeds the cost of staying. They achieve this through proprietary storage formats, egress fees, and the slow accumulation of pipeline logic written against their specific SQL dialect. Three years in, an engineering team cannot move their data without moving the business.

Decoupling the components from the beginning eliminates that leverage. Running compute on commodity hardware and storing data in open, universally readable formats does not require sacrifice in performance or capability. DuckDB, Parquet, and Python serve ninety percent of analytical workloads with no vendor dependency whatsoever.

The core thesis: DuckDB provides columnar, in-process analytics at speeds that compete directly with managed warehouses. Parquet provides compressed, open-standard storage. Python binds them. A complete data warehouse can execute on a forty-dollar virtual machine and scale to process terabytes on a dedicated server at a fraction of the managed service cost. Moving providers requires copying files. Decommissioning requires stopping a process. The vendor has no leverage.

The Five-Layer Stack

A zero-dependency architecture separates five distinct functional layers. Each layer uses open-source tooling, stores data in open formats, and can be replaced independently without disrupting the others. That replaceability is the point: it is what a vendor-dependent stack deliberately prevents.

Layer 1: Ingestion Pull from sources, write to open storage. No per-row pricing, no managed connectors.

Layer 2: Storage Open format. Readable by any engine. Zero egress fees.

Layer 3: Transformation Query Parquet directly. No data loading, no warehouse compute fees.

Layer 4: Orchestration Most teams do not need a distributed DAG runner.

Layer 5: Serving Version-controlled reports or a self-hosted BI tool. Both beat Looker's licence.

Example directory structure

/opt/data-stack/
├── ingestion/
│   ├── scripts/
│   │   ├── extract_stripe.py
│   │   └── extract_postgres.py
│   └── requirements.txt
├── dbt_project/
│   ├── models/
│   │   ├── staging/
│   │   └── marts/
│   └── dbt_project.yml
├── data/
│   ├── raw/           (Parquet files)
│   └── duckdb/
│       └── warehouse.db
└── orchestration/
    └── run_daily_pipeline.sh

Sample ingestion script (extract_stripe.py)

import os
import requests
import pandas as pd
from datetime import datetime

STRIPE_KEY = os.getenv("STRIPE_KEY")
TARGET_DIR = "/opt/data-stack/data/raw/stripe/charges/"

def extract_charges():
    headers = {"Authorization": f"Bearer {STRIPE_KEY}"}
    url = "https://api.stripe.com/v1/charges?limit=100"

    response = requests.get(url, headers=headers)
    data = response.json()["data"]

    if not data:
        return

    df = pd.DataFrame(data)

    today = datetime.now().strftime("%Y-%m-%d")
    os.makedirs(f"{TARGET_DIR}/date={today}", exist_ok=True)
    file_path = f"{TARGET_DIR}/date={today}/charges.parquet"

    df.to_parquet(file_path, engine="pyarrow", index=False)
    print(f"Extracted {len(df)} records to {file_path}")

if __name__ == "__main__":
    extract_charges()

Building the Stack

What follows is the implementation sequence on a single Ubuntu 22.04 LTS virtual machine. This is not a tutorial. It is the exact sequence used in a real build. Adapt the VM tier to the data volume; every other step is constant.

VM sizing by data volume

TierData volumeSpecificationMonthly cost (approx.)
StartupUnder 50 GB2 vCPUs, 8 GB RAM, 100 GB NVMe$20–$40
Scale-up50 GB – 500 GB8 vCPUs, 32 GB RAM, 1 TB NVMe$100–$150
Enterprise500 GB and above16 vCPUs, 128 GB RAM, 4 TB NVMe$400–$600

Step 1: Environment setup

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip python3-venv wget curl git unzip

sudo mkdir -p /opt/data-stack
sudo chown -R $USER:$USER /opt/data-stack
cd /opt/data-stack

python3 -m venv venv
source venv/bin/activate
pip install requests pandas pyarrow dbt-duckdb

Step 2: DuckDB CLI

wget https://github.com/duckdb/duckdb/releases/download/v0.10.0/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip -d /usr/local/bin
rm duckdb_cli-linux-amd64.zip

Step 3: Initial warehouse configuration

DuckDB is a file, not a service. There is no daemon to start. Interaction is via the CLI or Python.

duckdb /opt/data-stack/data/duckdb/warehouse.db
-- Create a view over raw Parquet files. No loading required.
CREATE SCHEMA raw;
CREATE VIEW raw.stripe_charges AS
SELECT * FROM read_parquet('/opt/data-stack/data/raw/stripe/charges/*/*.parquet');
.quit

Step 4: dbt configuration

mkdir /opt/data-stack/dbt_project
cd /opt/data-stack/dbt_project
dbt init my_warehouse

Configure ~/.dbt/profiles.yml:

my_warehouse:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /opt/data-stack/data/duckdb/warehouse.db
      threads: 4

Example dbt model (models/marts/fct_charges.sql):

{{ config(materialized='table') }}

SELECT
    id AS charge_id,
    amount / 100.0 AS amount_usd,
    status,
    created AS created_at
FROM {{ source('raw', 'stripe_charges') }}
WHERE status = 'succeeded'

Step 5: Serving layer

sudo apt install -y docker.io docker-compose

docker run -d -p 3000:3000 \
  -v /opt/data-stack/data/duckdb:/metabase-data \
  --name metabase metabase/metabase:latest

Install the community DuckDB driver and mount it into the Metabase plugins directory. Configuration documentation: metabase.com/docs.

Step 6: End-to-end pipeline script

Save as /opt/data-stack/orchestration/run_daily_pipeline.sh:

#!/bin/bash
set -e
source /opt/data-stack/venv/bin/activate

echo "1. Ingestion..."
python /opt/data-stack/ingestion/scripts/extract_stripe.py

echo "2. Transformations..."
cd /opt/data-stack/dbt_project
dbt run

echo "Pipeline complete."

Add to crontab (crontab -e):

0 2 * * * /opt/data-stack/orchestration/run_daily_pipeline.sh >> /var/log/data_pipeline.log 2>&1

Migration from a Managed Warehouse

The question is not whether to migrate. It is whether to migrate now, when the cost is known and the leverage is yours, or at the third-year renewal, when neither is true. The sequence below runs the migration in parallel with the existing stack. There is no cutover risk because nothing is removed until the replacement has been verified.

Phase 1: Shadow ingestion

Leave the current stack fully operational. Set up the VM and begin writing Python ingestion scripts that write Parquet files alongside the existing managed syncs. The goal is not to replace anything yet. It is to establish data parity in the raw layer and confirm that every source can be extracted without the managed connector.

Phase 2: SQL translation

Port the dbt models to DuckDB-compatible SQL. DuckDB's dialect is close to PostgreSQL: the translation from Snowflake requires attention to a handful of specific functions (ARRAY_AGG behaviour, some window function extensions, a few date arithmetic differences). Run dbt against both systems simultaneously until outputs match.

Phase 3: BI cutover

Connect a parallel Metabase or Evidence instance to the DuckDB database. Rebuild the five most-used dashboards. At this stage most teams discover that DuckDB returns query results faster than the managed warehouse they were paying for. That observation is useful in the decommission conversation.

Phase 4: Decommission

Disable the managed ingestion connectors. Terminate the Snowflake or BigQuery instance. Cancel the dbt Cloud subscription. The monthly infrastructure invoice drops to the VM cost. The leverage reversal is permanent: from this point, the architecture is yours.

Cost Comparison by Scale

The numbers below are based on published pricing as of mid-2025. The managed stack figures use conservative estimates: actual costs at scale are typically higher due to usage-based billing volatility.

Tier 1: Startup (under 50 GB, daily syncs)

ComponentVendor stackZero-dependency stack
IngestionFivetran: $300/moPython scripts: $0
Storage and computeSnowflake: $200/moVM + disk: $40/mo
Transformationdbt Cloud: $100/modbt Core: $0
ServingMetabase Cloud: $85/moMetabase CE: $0
Total$685/mo$40/mo
Annual saving: $7,740

At startup scale, the zero-dependency stack costs less than a single Fivetran connector.

Tier 2: Scale-up (50–500 GB, hourly syncs)

ComponentVendor stackZero-dependency stack
IngestionFivetran: $1,200/moPython scripts: $0
Storage and computeSnowflake: $1,500/moHigh-memory VM: $150/mo
Transformationdbt Cloud: $200/modbt Core: $0
ServingLooker: $1,500/moMetabase CE: $0
Total$4,400/mo$150/mo
Annual saving: $50,820

The serving layer alone (Looker) costs more each month than the entire zero-dependency stack.

Tier 3: Enterprise (500 GB and above, near-real-time)

ComponentVendor stackZero-dependency stack
IngestionFivetran: $3,000/moPython scripts: $0
Storage and computeSnowflake: $5,000+/moBare metal server: $450/mo
Transformationdbt Cloud: $500/modbt Core: $0
ServingLooker: $3,000/moMetabase CE: $0
Total$11,500+/mo$450/mo
Annual saving: $131,400+

At enterprise scale, the managed stack costs more per month than the zero-dependency stack costs in a year.

Maintenance

Running self-hosted infrastructure requires an operations posture. The posture is minimal: keep the data backed up, monitor the pipeline, upgrade on a schedule. Nothing here requires a dedicated operations team.

Backup

Raw data already exists in Parquet on disk or in object storage: those files are the backup. The DuckDB warehouse file and Metabase application database need a separate backup job.

# Push the full data stack directory to a cold storage bucket hourly
aws s3 sync /opt/data-stack/ s3://my-cold-backup-bucket/data-stack/ --delete

Monitoring

Do not install Datadog. It contradicts the architecture's purpose. Use a shell trap to send a Slack webhook if the daily pipeline script fails. Install Netdata (one-line installation) for a local CPU, RAM, and disk I/O dashboard that sends nothing to a third party.

Upgrade cadence

When the stack strains under load

If the VM runs out of memory during dbt transformations: vertical scaling takes two minutes (shut down, resize in the provider console, restart). If query performance degrades on large datasets: ensure dbt is running incremental models, not full refreshes, and move raw Parquet files from local disk to object storage for DuckDB to query remotely via HTTPFS extension. These are the two adjustments that resolve ninety percent of performance problems at scale.

Nauman Shahid builds zero-dependency data infrastructure for organisations in the UAE and the Gulf region. If your current data infrastructure cost or vendor exposure is a concern, diagnostic audit engagements are available through www.mindflex.tech.

These documents come from live diagnostic work. If your data infrastructure, vendor exposure, or compliance posture needs attention:

Discuss a diagnostic engagement →