Back to Insights

January 29, 2026

Understanding Databricks costs through System Tables

Databricks cost analysis must start from billable usage, the raw metered consumption recorded authoritatively in System Tables, because Databricks never stores cost directly – pricing is always applied later by joining usage with the usage system prices. System Tables are the only reliable source of truth for building accurate, auditable cost dashboards, with other tables (workspaces, clusters, jobs, pipelines) serving only to enrich usage with context, not generate cost.

If you’ve ever tried to answer questions like:

  • Why did this workspace suddenly spike in cost?
  • Which jobs are actually driving our Databricks spend?
  • Where is the money going if clusters are mostly idle?

and ended up with partial or conflicting answers, you’re not alone.

In Databricks, costs do not live where you intuitively expect them to.
They don’t live in clusters.
They don’t live in jobs.
They don’t live in pipelines.

And if your analysis starts there, it will be wrong.

The only place where Databricks billing truly starts is System Tables.

This post explains what they are, why they are the source of truth, and how to use them correctly to build reliable, auditable, and scalable cost dashboards.


The Most Common Mistake: Calculating Costs from the Wrong Place

Many teams begin their cost analysis by looking at:

  • Clusters
  • Jobs
  • Pipelines
  • DBUs reported at the workload level

This approach does not work.

Databricks does not bill clusters, jobs, or pipelines.
Databricks bills usage.

Every dollar you are charged by Databricks originates from a single table:

system.billing.usage

If your analysis does not start there, you are not analyzing real Databricks costs.


What “Billable Usage” Actually Means

Billable usage is the raw, metered record of everything Databricks decides to charge for.

Each record includes:

  • A SKU (sku_name)
  • A usage quantity (usage_quantity)
  • A time window
  • Optional workload metadata (job, cluster, pipeline, warehouse)

Important:

  • This is not cost
  • There is no currency
  • No pricing is applied yet

Usage is just measurement.
Cost is always derived later.


Why System Tables Are the Source of Truth

Databricks System Tables are Databricks-managed, read-only Delta tables that centralize:

  • Usage
  • Pricing
  • Operational metadata
  • Workspace context

They live in the system catalog and support account-wide analysis, not just per-workspace views.

This design enables a critical principle:

Usage, pricing, and context are stored separately

That separation is intentional—and it is what makes accurate cost attribution possible.


The Correct Mental Model: One Base Table, Everything Else Is Enrichment

This distinction matters.

✅ Base Table (the only source of cost)

  • system.billing.usage

Every cost metric originates here.

➕ Enrichment Tables (context only)

These tables do not generate cost.

  • Pricing

    • system.billing.list_prices

  • Workspace context

    • system.access.workspaces_latest

  • Compute metadata

    • system.compute.clusters

    • system.compute.warehouses

  • Workload metadata

    • system.lakeflow.jobs

    • system.lakeflow.pipelines

Think of it as a hub-and-spoke model:

Usage is the hub. Everything else adds meaning.


How Cost Is Actually Calculated

There is no magic.

cost = usage_quantity × effective_price

The effective price comes from system.billing.list_prices, matched by:

  • SKU
  • Usage unit
  • Cloud provider
  • Effective pricing time window

No shortcuts.
No hidden columns with final cost values.

That explicitness is a feature, not a limitation.


Join Patterns That Work (and Those That Don’t)

Every reliable cost dashboard follows the same rules:

1. Usage → Pricing

Always use LEFT JOIN.

Missing pricing should never drop usage records.

2. Usage → Workspace

Used to report costs by meaningful names, not opaque IDs.

3. Usage → Clusters / Jobs / Pipelines

Used only for attribution, never for cost generation.

Additional realities:

  • Many system tables are SCD Type 2

  • You must consciously choose latest snapshot vs historical state

Ignoring this leads to misattributed costs.


What This Model Enables You to Answer

With this approach, you can reliably answer questions such as:

  • Cost per workspace (account-wide)
  • Daily, weekly, and monthly cost trends
  • Most expensive jobs
  • Cost by pipeline
  • SQL vs Jobs vs DLT
  • Cost by cluster policy
  • Cost by SKU
  • Cost by tags (when governance is in place)

All aligned with how Databricks actually bills you.


Limitations You Must Acknowledge

1. Storage Costs Are Incomplete

Databricks system tables only include limited, Databricks-metered storage.

Most storage costs are billed directly by the cloud provider (S3, ADLS, GCS) and do not appear here.

Implication:
Databricks cost ≠ total cloud cost.

Any dashboard that ignores this is incomplete.


2. Serverless Breaks Cluster Attribution

Serverless workloads often have:

  • No cluster_id
  • Sometimes no warehouse_id

This makes cluster-level attribution impossible in some cases.

Correct approach for serverless:

  • Use billing_origin_product
  • Use job or pipeline identifiers
  • Rely on tags and governance metadata

Not clusters.


So What Are System Tables Really For?

They are not “just another data source”.

They are the only reliable foundation for:

  • Understanding what you are being charged for
  • Explaining costs to stakeholders
  • Implementing chargeback and showback models
  • Detecting anomalies early
  • Making informed optimization decisions

If you have:

  • Multiple workspaces
  • Multiple teams
  • Any form of cost governance requirement

System Tables are not optional.


Final Takeaway

If you remember one thing, make it this:

In Databricks, costs are not calculated from workloads. They are calculated from usage.

System Tables are not a convenience feature. They are the source of truth. Everything else is just context.

Avatar photo
Santiago Fernández
Aldis Stareczek
Aldis Stareczek

By Santiago Fernández and Aldis Stareczek

Data Engineer at Qubika and Solutions Engineer & Databricks Champion

Santiago Fernández is a Data Engineer at Qubika with a strong focus on building scalable, well-governed data platforms on Databricks. His work has strengthened his expertise in designing reliable data pipelines, applying lakehouse best practices, and enforcing governance standards that enable consistency, traceability, and operational efficiency across data environments.

Aldis Stareczek Ferrari is a Senior Data Analyst and Databricks Champion at Qubika, specializing in lakehouse architectures, data pipelines, and governance with Unity Catalog. She combines strong business understanding with deep technical expertise to design high-quality, scalable data solutions aligned with real business needs. She leads Qubika’s Databricks community initiatives, organizing meetups and tours, publishing technical guidance and reference architectures, managing Qubika’s Databricks Reddit presence, and overseeing more than 200 Databricks-certified engineers to keep credentials current and continuously strengthen Qubika’s partner status. Credentials: M.Sc. in Data Science (UTEC) and Food Engineer (Universidad de la República).

News and things that inspire us

Receive regular updates about our latest work

Let’s work together

Get in touch with our experts to review your idea or product, and discuss options for the best approach

Get in touch