9 min read

GitOps for Apache Superset

certifying datasets via GitHub Actions

How we treat Superset datasets as code: a GitHub-as-authority workflow with schema validation, Jinja parsing, SQLFluff, and a certification badge that points back to the exact CI run.

By drafted.work· Operational data team

In a modern data stack, "dashboard drift" — the phenomenon where metrics in your BI tool slowly diverge from the source of truth — is a constant threat. In distributed teams spanning multiple time zones, manual edits inside the Apache Superset UI quickly lead to inconsistent metrics, broken charts, and a lack of auditability.

To solve this, we implemented a GitOps workflow for Superset. By treating datasets as code, we've established a single source of truth, automated quality gates, and a visible certification process that keeps production dashboards reliable.

1. The core philosophy: Git as the authority

Our transition to GitOps was driven by five core requirements:

  • Single source of truth — the GitHub repository is the authoritative source for all production datasets.
  • Version control & audit trail — every change to a column, metric, or SQL query is tracked via Pull Requests.
  • Automated quality gates — no "breaking" SQL or invalid Jinja can hit production.
  • Visible certification — a badge in the Superset UI links directly to the last successful GitHub Actions run.
  • UI locking — once certified via Git, the dataset is locked in the UI to prevent manual overrides.

2. Repository architecture

We mirror the Superset dataset structure inside our repository. Each dataset is stored in a folder named after its dataset_id.

The file structure

datasets/
└── 34/
    ├── dataset.json              # Core metadata (ID, database, schema, table name)
    ├── columns.json              # Physical and basic virtual columns
    ├── calculated_columns.json   # Complex virtual columns with logic
    ├── metrics.json              # Saved metrics (SQL expressions and labels)
    └── sql.sql                   # Source SQL for virtual datasets (Jinja supported)

Example: dataset.json

{
  "dataset_id": 34,
  "description": "Core store reference table",
  "database_id": 14,
  "schema": "b2b",
  "table_name": "All_stores_summary"
}

3. The four-step validation pipeline

Every Pull Request triggers a GitHub Actions workflow. To keep feedback fast, we use git diff to identify and validate only the changed dataset folders.

Step 1 — Structure validation

The system verifies that the folder name matches the internal dataset_id and checks for the existence of mandatory files. It then performs JSON schema validation to ensure field types and lengths are correct before any API attempt.

Step 2 — Cross-reference validation

This step ensures internal consistency within the dataset files:

  • Do metric expressions reference columns that actually exist in columns.json?
  • Does the table_name in the metadata align with the references in the SQL file?

Step 3 — Jinja template parsing

Because Superset uses Jinja for dynamic filtering (for example {{ current_user_id() }}), we use the jinja2 Python library to parse the SQL. This catches syntax errors and nesting mistakes that would otherwise fail at runtime inside the dashboard.

Step 4 — SQL linting (SQLFluff)

We enforce a strict SQL style guide using SQLFluff with a Postgres dialect. Notable rules:

  • AM08 — no implicit cross-joins.
  • ST11 — no unused joins.
  • AM06 — no ambiguous column references.
  • Standardisation — no SELECT *, strict complexity limits.

4. UI integration & certification

When the pipeline passes, the deploy job fires a PATCH request to the Superset API. We lean on Superset's built-in certification metadata fields to close the loop:

{
  "certification": {
    "certified_by": "GitHub Action",
    "details": "https://github.com/org/repo/actions/runs/XXXXXX"
  }
}

The user experience

In our custom Superset fork, a certification badge appears on the dataset. When a user hovers over the badge, they see a direct link to the specific GitHub Action run that verified the code. The dataset also becomes read-only in the UI, forcing any further changes through the established PR process.

5. Wins & strategic outlook

Key wins

  • Trust — stakeholders can verify when and how a metric was last updated.
  • Stability — a 90% reduction in "broken dashboard" reports caused by manual metadata edits.
  • Speed — the selective CI approach gives analysts feedback in seconds.

Challenges & next steps

Adoption is our biggest hurdle. Moving from a GUI to a Git workflow is a cultural shift for many analysts. We are developing internal "UI helpers" to make the transition to Git as frictionless as possible.

Who should implement this?

This setup is ideal for self-hosted Superset instances and data-engineering teams that are drowning in toil. While we use a custom fork for UI locking, about 80% of this workflow can be achieved on vanilla Superset using the REST API and the standard metadata fields.

Also on X

A thread on the same workflow: x.com/RtKazakov.

Topics

  • Apache Superset
  • GitOps
  • GitHub Actions
  • SQLFluff
  • Data quality