Scaling Down the Cluster – Why Data Needs Its Own Local Revolution (Part 1)

5.6.202511.6.2025

Why This Matters to the Business?

Running cloud compute for development work is often overkill, and expensive. Most developers work on small data slices but incur significant cloud costs. Local-first tools help teams cut these costs, iterate faster, and deliver insights sooner.

They also simplify onboarding, reduce infrastructure setup, and shrink feedback loops, meaning less time waiting, more time building. For many organizations, these gains translate directly into lower spend and faster results.

From Cloud-First to Local-First in Data

Over the course of my career in data, I’ve experienced firsthand how most data development is tied to cloud infrastructure. Whether remoted into virtual machines or connected to cloud data platforms via a browser UI or local IDE, nearly all of my work has been done in remote environments.

While powerful, these setups often slow down iteration, introduce dependency on network stability, and add overhead for even simple development tasks. Interestingly, even major cloud providers recognize the need for local-first workflows, as they offer tools and technologies to virtualize object storage and simulate serverless functions locally to enable local software development.

Yet in data engineering, local development practices remain uncommon despite having many of the same opportunities as in software development.

Decoupling Storage and Compute: A New Opportunity

A major enabler of local-first data development is the decoupling of storage and compute. Thanks to modern data lake architectures and the widespread adoption of open-source file formats like Parquet and open table formats such as Delta Lake, Apache Iceberg, and Apache Hudi, organizations are no longer locked into a single data processing engine.

Cloud-First Inefficiencies

In software engineering, it’s standard practice to develop locally. Web developers spin up servers, databases, and apps on their local computer. They iterate rapidly, with full control, then push code to remote repositories and deploy to shared environments. This is efficient, cost-effective, and fast.

In contrast, data engineering and analytics workflows often begin directly in the cloud, using platforms like BigQuery, Databricks, Snowflake, Redshift, or Microsoft Fabric. This cloud-first mindset can introduce significant inefficiencies:

Developers interact with production-scale environments even though they typically work with only small subsets of data at a given time.
Cloud-based development environments often run with low utilization, leading to unnecessary costs.
Developer laptops sit idle while cloud compute accumulates charges.

The Case for Local Data Development

Enabling Local Development for Data Engineers

What if data engineers could work more like software developers?

Tools like DuckDB and Polars make this possible. They enable high-performance analytics directly on a developer’s machine and integrate well with modern transformation frameworks like dbt and SQLMesh. This shift supports:

Local SQL and DataFrame development with fast feedback loops
Shorter development cycles and faster iteration
Reduced reliance on always-on cloud infrastructure
Lower barriers for onboarding new team members

While data engineers still use development, staging, and testing environments, the ability to build and test data models and pipelines locally means fewer dependencies on shared or cloud-based resources during early development. It also creates a more focused, efficient workflow.

What About Spark?

Apache Spark is undeniably popular, and for good reason. It’s proven, battle-tested, and highly scalable. For large-scale batch processing, distributed ETL jobs, and long-running analytics pipelines that operate on terabytes or petabytes of data, Spark continues to be an essential tool in many data stacks.

However, its popularity has also made it something of a default choice, even in scenarios where it’s not strictly necessary. Many teams and organizations reach for Spark out of habit or perceived industry standardization, despite working with data volumes that could be processed more efficiently using simpler, single-node tools.

This isn’t an argument against Spark, but rather a reminder that one size does not fit all. Spark absolutely has its place, but we should evaluate it on a case-by-case basis rather than assuming it’s always the right tool. Lightweight, local-first alternatives like DuckDB and Polars offer substantial performance and cost benefits for a large portion of modern data workloads, especially during development or when working with moderate-scale data.

Open-source Apache Spark can also be run locally, but it tends to be more cumbersome to set up and manage. Its distributed nature introduces unnecessary complexity for small-scale development and often requires additional configuration. The big managed platforms offering Spark, are utilizing their own closed source optimizations, which makes them meaningfully different from the open-source version. In contrast, DuckDB and Polars offer a far simpler and more accessible developer experience while still delivering excellent performance on typical dev-scale data.

Meet the Tools: DuckDB and Polars

DuckDB: Embedded Analytics Powerhouse

DuckDB is not just fast and portable; it’s also highly extensible. Its plugin architecture allows developers to expand its capabilities with minimal overhead. Extensions bring support for advanced features like Iceberg REST Catalogs, Databricks Unity Catalog, and Vector Search, enabling integration with modern data lake ecosystems and AI/LLM workflows. This flexibility positions DuckDB as a serious candidate not only for analytics, but also for search and inference scenarios in intelligent applications.

DuckDB is a lightweight, embedded analytical SQL engine. It supports direct querying of local files like CSV and Parquet, integrates with both SQL and DataFrame paradigms, and runs entirely in-process—no server, no connection strings, no client setup. For small to medium-sized datasets, it’s remarkably fast.

Its MIT license and portable design make it a flexible Swiss army knife for embedded analytics in dashboards, ETL scripts, and cloud functions. Benchmarks show DuckDB can be significantly more cost-effective than Spark for many workloads.

Polars: Fast, Scalable, Pythonic

Polars is a DataFrame library written in Rust, optimized for single-machine, multi-threaded performance. It offers a more expressive API than pandas and, in many cases, dramatically

better performance. Polars can efficiently process gigabytes of data on a single machine and run seamlessly in environments like Fabric notebooks, Kubernetes, and Jupyter Notebooks.

Polars Cloud (currently in closed beta) brings serverless, distributed compute to the same Polars interface—allowing seamless horizontal scaling with CPU/GPU optimization. It supports usage-based pricing and automatic resource management.

Beyond Performance: Portability and Serverless Execution

These tools aren’t just efficient; they’re portable. Both DuckDB and Polars can be packaged and deployed on serverless cloud services such as AWS Lambda, Azure Functions, or GCP Cloud Run. They can interact with your data lake and even run in the browser. This allows on-demand execution with no infrastructure management.

These capabilities are ideal for ELT workloads where you only pay for compute when it’s used or for web apps with embedded analytical functionality. This model particularly excels in the Load and Transform stages, turning raw data into structured layers like in the highly popular Medallion Architecture with its Bronze, Silver, and Gold layers.

A Modern Development Pattern for Data Engineering

Current Workflow Challenges

Current Pattern:

Write transformations directly in a cloud warehouse or distributed platform
Pay per use (data scanned, compute time)
Maintain dev infra with often low resource utilization
Face slow feedback loops

The Local-First Alternative

Local-First Pattern:

Develop models locally using DuckDB or Polars, potentially with frameworks like dbt and SQLMesh
Load data from your cloud storage and run transformations instantly, offline, and free
Transition seamlessly to cloud platforms for production
Use serverless for burst workloads with efficient pay-per-use pricing

Why This Matters: Optimizing Developer Experience and TCO

Even though production data volumes might be large, most developers work with small subsets at a time. Local-first tools empower them to iterate faster without touching cloud infra until it’s time to scale.

In addition, many organizations don’t require horizontally distributed platforms for multi-terabyte or petabyte workloads. This makes local and serverless tools more than viable—they’re optimal.

Reducing Total Cost of Ownership (TCO)

Adopting a local-first approach helps reduce TCO not just by cutting cloud costs, but by:

Lowering time spent debugging cloud configs and environments
Simplifying onboarding for new team members
Reducing engineering overhead for infra setup and maintenance

For small-to-medium enterprises (SMEs), these savings are especially impactful—the cost of engineering time and operational complexity often outweighs raw infrastructure spend. A shorter feedback loop, simplified toolchain, and reduced cognitive load all translate to more productive teams and faster delivery.

What’s Coming Next

This is the first post in a series introducing how modern tools like DuckDB, Polars, dbt, and SQLMesh can improve developer experience, accelerate iteration cycles, and reduce cloud spend.

Next Post: We’ll walk through a practical demo using DuckDB transformations locally and then deploy the same code to a cloud platform.

Closing Thought: Reclaiming Local in the Age of Cloud

Software engineers have long enjoyed the speed and efficiency of local development. With tools like DuckDB and Polars, it’s time data teams did the same.

In a time of budget scrutiny and performance focus, the potential benefits of shorter feedback loops, simpler infrastructure footprints, and reduced TCO are too great to ignore.

Scaling Down the Cluster – Why Data Needs Its Own Local Revolution (Part 1)

Why This Matters to the Business?

From Cloud-First to Local-First in Data

Decoupling Storage and Compute: A New Opportunity

Cloud-First Inefficiencies

The Case for Local Data Development

Enabling Local Development for Data Engineers

What About Spark?

Meet the Tools: DuckDB and Polars

DuckDB: Embedded Analytics Powerhouse

Polars: Fast, Scalable, Pythonic

Beyond Performance: Portability and Serverless Execution

A Modern Development Pattern for Data Engineering

Current Workflow Challenges

The Local-First Alternative

Why This Matters: Optimizing Developer Experience and TCO

Reducing Total Cost of Ownership (TCO)

What’s Coming Next

Closing Thought: Reclaiming Local in the Age of Cloud

Your Data is Propably Just Fine – Let’s Start Unlocking AI Value

Exploring AIExp: Teamit’s AI Experimentation Team

Integrating a language model into an API interface

Why This Matters to the Business?

From Cloud-First to Local-First in Data

Decoupling Storage and Compute: A New Opportunity

Cloud-First Inefficiencies

The Case for Local Data Development

Enabling Local Development for Data Engineers

What About Spark?

Meet the Tools: DuckDB and Polars

DuckDB: Embedded Analytics Powerhouse

Polars: Fast, Scalable, Pythonic

Beyond Performance: Portability and Serverless Execution

A Modern Development Pattern for Data Engineering

Current Workflow Challenges

The Local-First Alternative

Why This Matters: Optimizing Developer Experience and TCO

Reducing Total Cost of Ownership (TCO)

What’s Coming Next

Closing Thought: Reclaiming Local in the Age of Cloud

Related posts

Your Data is Propably Just Fine – Let’s Start Unlocking AI Value

Exploring AIExp: Teamit’s AI Experimentation Team

Integrating a language model into an API interface