Data infrastructure reference
A practitioner's reference for building analytics infrastructure without managed platforms, proprietary formats, or vendor dependency.
The vulnerability becomes visible at the third-year renewal. The quote is forty percent higher. An engineering lead three years into a Snowflake contract calculates the migration cost, realises it exceeds the price increase, and signs the renewal. That is the exact moment an architecture transitions from a technical choice into a vendor capture mechanism.
This document describes the architecture that prevents that moment from arriving. Not because vendor tools are without merit, but because the decision to use them should be deliberate, and the exit path should always be known before the contract is signed. Most organisations discover the exit path does not exist at the moment they most need it.
Managed platforms engineer their pricing to reward initial adoption and penalise scale. An engineering team building on Snowflake, BigQuery, or Databricks pays a premium for an ecosystem built to make egress painful. The pipeline relies on proprietary SQL dialects. The storage sits in formats readable only through the vendor's compute layer. The budget is tied to an arbitrary credit system with no ceiling.
The failure modes are consistent across organisations. The success penalty: as data volume grows, the bill scales exponentially. The zombie cost: the organisation pays for compute availability even when nothing is running. Black box tuning: the underlying hardware is inaccessible, leaving credit inflation as the only available lever when performance degrades.
None of this is accidental. The vendor's objective is to ensure that the cost of leaving always exceeds the cost of staying. They achieve this through proprietary storage formats, egress fees, and the slow accumulation of pipeline logic written against their specific SQL dialect. Three years in, an engineering team cannot move their data without moving the business.
Decoupling the components from the beginning eliminates that leverage. Running compute on commodity hardware and storing data in open, universally readable formats does not require sacrifice in performance or capability. DuckDB, Parquet, and Python serve ninety percent of analytical workloads with no vendor dependency whatsoever.
A zero-dependency architecture separates five distinct functional layers. Each layer uses open-source tooling, stores data in open formats, and can be replaced independently without disrupting the others. That replaceability is the point: it is what a vendor-dependent stack deliberately prevents.
s3://data-lake/source_system/entity/year=2024/month=10//opt/data-stack/
├── ingestion/
│ ├── scripts/
│ │ ├── extract_stripe.py
│ │ └── extract_postgres.py
│ └── requirements.txt
├── dbt_project/
│ ├── models/
│ │ ├── staging/
│ │ └── marts/
│ └── dbt_project.yml
├── data/
│ ├── raw/ (Parquet files)
│ └── duckdb/
│ └── warehouse.db
└── orchestration/
└── run_daily_pipeline.sh
extract_stripe.py)import os
import requests
import pandas as pd
from datetime import datetime
STRIPE_KEY = os.getenv("STRIPE_KEY")
TARGET_DIR = "/opt/data-stack/data/raw/stripe/charges/"
def extract_charges():
headers = {"Authorization": f"Bearer {STRIPE_KEY}"}
url = "https://api.stripe.com/v1/charges?limit=100"
response = requests.get(url, headers=headers)
data = response.json()["data"]
if not data:
return
df = pd.DataFrame(data)
today = datetime.now().strftime("%Y-%m-%d")
os.makedirs(f"{TARGET_DIR}/date={today}", exist_ok=True)
file_path = f"{TARGET_DIR}/date={today}/charges.parquet"
df.to_parquet(file_path, engine="pyarrow", index=False)
print(f"Extracted {len(df)} records to {file_path}")
if __name__ == "__main__":
extract_charges()
What follows is the implementation sequence on a single Ubuntu 22.04 LTS virtual machine. This is not a tutorial. It is the exact sequence used in a real build. Adapt the VM tier to the data volume; every other step is constant.
| Tier | Data volume | Specification | Monthly cost (approx.) |
|---|---|---|---|
| Startup | Under 50 GB | 2 vCPUs, 8 GB RAM, 100 GB NVMe | $20–$40 |
| Scale-up | 50 GB – 500 GB | 8 vCPUs, 32 GB RAM, 1 TB NVMe | $100–$150 |
| Enterprise | 500 GB and above | 16 vCPUs, 128 GB RAM, 4 TB NVMe | $400–$600 |
sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip python3-venv wget curl git unzip
sudo mkdir -p /opt/data-stack
sudo chown -R $USER:$USER /opt/data-stack
cd /opt/data-stack
python3 -m venv venv
source venv/bin/activate
pip install requests pandas pyarrow dbt-duckdb
wget https://github.com/duckdb/duckdb/releases/download/v0.10.0/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip -d /usr/local/bin
rm duckdb_cli-linux-amd64.zip
DuckDB is a file, not a service. There is no daemon to start. Interaction is via the CLI or Python.
duckdb /opt/data-stack/data/duckdb/warehouse.db
-- Create a view over raw Parquet files. No loading required.
CREATE SCHEMA raw;
CREATE VIEW raw.stripe_charges AS
SELECT * FROM read_parquet('/opt/data-stack/data/raw/stripe/charges/*/*.parquet');
.quit
mkdir /opt/data-stack/dbt_project
cd /opt/data-stack/dbt_project
dbt init my_warehouse
Configure ~/.dbt/profiles.yml:
my_warehouse:
target: dev
outputs:
dev:
type: duckdb
path: /opt/data-stack/data/duckdb/warehouse.db
threads: 4
Example dbt model (models/marts/fct_charges.sql):
{{ config(materialized='table') }}
SELECT
id AS charge_id,
amount / 100.0 AS amount_usd,
status,
created AS created_at
FROM {{ source('raw', 'stripe_charges') }}
WHERE status = 'succeeded'
sudo apt install -y docker.io docker-compose
docker run -d -p 3000:3000 \
-v /opt/data-stack/data/duckdb:/metabase-data \
--name metabase metabase/metabase:latest
Install the community DuckDB driver and mount it into the Metabase plugins directory. Configuration documentation: metabase.com/docs.
Save as /opt/data-stack/orchestration/run_daily_pipeline.sh:
#!/bin/bash
set -e
source /opt/data-stack/venv/bin/activate
echo "1. Ingestion..."
python /opt/data-stack/ingestion/scripts/extract_stripe.py
echo "2. Transformations..."
cd /opt/data-stack/dbt_project
dbt run
echo "Pipeline complete."
Add to crontab (crontab -e):
0 2 * * * /opt/data-stack/orchestration/run_daily_pipeline.sh >> /var/log/data_pipeline.log 2>&1
The question is not whether to migrate. It is whether to migrate now, when the cost is known and the leverage is yours, or at the third-year renewal, when neither is true. The sequence below runs the migration in parallel with the existing stack. There is no cutover risk because nothing is removed until the replacement has been verified.
Leave the current stack fully operational. Set up the VM and begin writing Python ingestion scripts that write Parquet files alongside the existing managed syncs. The goal is not to replace anything yet. It is to establish data parity in the raw layer and confirm that every source can be extracted without the managed connector.
Port the dbt models to DuckDB-compatible SQL. DuckDB's dialect is close to PostgreSQL: the translation from Snowflake requires attention to a handful of specific functions (ARRAY_AGG behaviour, some window function extensions, a few date arithmetic differences). Run dbt against both systems simultaneously until outputs match.
Connect a parallel Metabase or Evidence instance to the DuckDB database. Rebuild the five most-used dashboards. At this stage most teams discover that DuckDB returns query results faster than the managed warehouse they were paying for. That observation is useful in the decommission conversation.
Disable the managed ingestion connectors. Terminate the Snowflake or BigQuery instance. Cancel the dbt Cloud subscription. The monthly infrastructure invoice drops to the VM cost. The leverage reversal is permanent: from this point, the architecture is yours.
The numbers below are based on published pricing as of mid-2025. The managed stack figures use conservative estimates: actual costs at scale are typically higher due to usage-based billing volatility.
| Component | Vendor stack | Zero-dependency stack |
|---|---|---|
| Ingestion | Fivetran: $300/mo | Python scripts: $0 |
| Storage and compute | Snowflake: $200/mo | VM + disk: $40/mo |
| Transformation | dbt Cloud: $100/mo | dbt Core: $0 |
| Serving | Metabase Cloud: $85/mo | Metabase CE: $0 |
| Total | $685/mo | $40/mo |
At startup scale, the zero-dependency stack costs less than a single Fivetran connector.
| Component | Vendor stack | Zero-dependency stack |
|---|---|---|
| Ingestion | Fivetran: $1,200/mo | Python scripts: $0 |
| Storage and compute | Snowflake: $1,500/mo | High-memory VM: $150/mo |
| Transformation | dbt Cloud: $200/mo | dbt Core: $0 |
| Serving | Looker: $1,500/mo | Metabase CE: $0 |
| Total | $4,400/mo | $150/mo |
The serving layer alone (Looker) costs more each month than the entire zero-dependency stack.
| Component | Vendor stack | Zero-dependency stack |
|---|---|---|
| Ingestion | Fivetran: $3,000/mo | Python scripts: $0 |
| Storage and compute | Snowflake: $5,000+/mo | Bare metal server: $450/mo |
| Transformation | dbt Cloud: $500/mo | dbt Core: $0 |
| Serving | Looker: $3,000/mo | Metabase CE: $0 |
| Total | $11,500+/mo | $450/mo |
At enterprise scale, the managed stack costs more per month than the zero-dependency stack costs in a year.
Running self-hosted infrastructure requires an operations posture. The posture is minimal: keep the data backed up, monitor the pipeline, upgrade on a schedule. Nothing here requires a dedicated operations team.
Raw data already exists in Parquet on disk or in object storage: those files are the backup. The DuckDB warehouse file and Metabase application database need a separate backup job.
# Push the full data stack directory to a cold storage bucket hourly
aws s3 sync /opt/data-stack/ s3://my-cold-backup-bucket/data-stack/ --delete
Do not install Datadog. It contradicts the architecture's purpose. Use a shell trap to send a Slack webhook if the daily pipeline script fails. Install Netdata (one-line installation) for a local CPU, RAM, and disk I/O dashboard that sends nothing to a third party.
apt update monthly during a low-usage window.If the VM runs out of memory during dbt transformations: vertical scaling takes two minutes (shut down, resize in the provider console, restart). If query performance degrades on large datasets: ensure dbt is running incremental models, not full refreshes, and move raw Parquet files from local disk to object storage for DuckDB to query remotely via HTTPFS extension. These are the two adjustments that resolve ninety percent of performance problems at scale.
Nauman Shahid builds zero-dependency data infrastructure for organisations in the UAE and the Gulf region. If your current data infrastructure cost or vendor exposure is a concern, diagnostic audit engagements are available through www.mindflex.tech.
Want the vendor lock-in diagnostic instrument?
The Vendor Lock-In Audit →These documents come from live diagnostic work. If your data infrastructure, vendor exposure, or compliance posture needs attention:
Discuss a diagnostic engagement →