GitOps for Apache Superset
certifying datasets via GitHub Actions
How we treat Superset datasets as code: a GitHub-as-authority workflow with schema validation, Jinja parsing, SQLFluff, and a certification badge that points back to the exact CI run.
In a modern data stack, "dashboard drift" — the phenomenon where metrics in your BI tool slowly diverge from the source of truth — is a constant threat. In distributed teams spanning multiple time zones, manual edits inside the Apache Superset UI quickly lead to inconsistent metrics, broken charts, and a lack of auditability.
To solve this, we implemented a GitOps workflow for Superset. By treating datasets as code, we've established a single source of truth, automated quality gates, and a visible certification process that keeps production dashboards reliable.
1. The core philosophy: Git as the authority
Our transition to GitOps was driven by five core requirements:
- Single source of truth — the GitHub repository is the authoritative source for all production datasets.
- Version control & audit trail — every change to a column, metric, or SQL query is tracked via Pull Requests.
- Automated quality gates — no "breaking" SQL or invalid Jinja can hit production.
- Visible certification — a badge in the Superset UI links directly to the last successful GitHub Actions run.
- UI locking — once certified via Git, the dataset is locked in the UI to prevent manual overrides.
2. Repository architecture
We mirror the Superset dataset structure inside our repository. Each dataset is
stored in a folder named after its dataset_id.
The file structure
datasets/
└── 34/
├── dataset.json # Core metadata (ID, database, schema, table name)
├── columns.json # Physical and basic virtual columns
├── calculated_columns.json # Complex virtual columns with logic
├── metrics.json # Saved metrics (SQL expressions and labels)
└── sql.sql # Source SQL for virtual datasets (Jinja supported)Example: dataset.json
{
"dataset_id": 34,
"description": "Core store reference table",
"database_id": 14,
"schema": "b2b",
"table_name": "All_stores_summary"
}3. The four-step validation pipeline
Every Pull Request triggers a GitHub Actions workflow. To keep feedback fast,
we use git diff to identify and validate only the changed dataset folders.
Step 1 — Structure validation
The system verifies that the folder name matches the internal dataset_id and
checks for the existence of mandatory files. It then performs JSON schema
validation to ensure field types and lengths are correct before any API
attempt.
Step 2 — Cross-reference validation
This step ensures internal consistency within the dataset files:
- Do metric expressions reference columns that actually exist in
columns.json? - Does the
table_namein the metadata align with the references in the SQL file?
Step 3 — Jinja template parsing
Because Superset uses Jinja for dynamic filtering (for example
{{ current_user_id() }}), we use the jinja2 Python library to parse the
SQL. This catches syntax errors and nesting mistakes that would otherwise fail
at runtime inside the dashboard.
Step 4 — SQL linting (SQLFluff)
We enforce a strict SQL style guide using SQLFluff with a Postgres dialect. Notable rules:
AM08— no implicit cross-joins.ST11— no unused joins.AM06— no ambiguous column references.- Standardisation — no
SELECT *, strict complexity limits.
4. UI integration & certification
When the pipeline passes, the deploy job fires a PATCH request to the
Superset API. We lean on Superset's built-in certification metadata fields to
close the loop:
{
"certification": {
"certified_by": "GitHub Action",
"details": "https://github.com/org/repo/actions/runs/XXXXXX"
}
}The user experience
In our custom Superset fork, a certification badge appears on the dataset. When a user hovers over the badge, they see a direct link to the specific GitHub Action run that verified the code. The dataset also becomes read-only in the UI, forcing any further changes through the established PR process.
5. Wins & strategic outlook
Key wins
- Trust — stakeholders can verify when and how a metric was last updated.
- Stability — a 90% reduction in "broken dashboard" reports caused by manual metadata edits.
- Speed — the selective CI approach gives analysts feedback in seconds.
Challenges & next steps
Adoption is our biggest hurdle. Moving from a GUI to a Git workflow is a cultural shift for many analysts. We are developing internal "UI helpers" to make the transition to Git as frictionless as possible.
Who should implement this?
This setup is ideal for self-hosted Superset instances and data-engineering teams that are drowning in toil. While we use a custom fork for UI locking, about 80% of this workflow can be achieved on vanilla Superset using the REST API and the standard metadata fields.
Also on X
A thread on the same workflow: x.com/RtKazakov.