ETL Pipeline Monitoring

free

Monitor data pipeline health — row counts, freshness, schema changes, volume anomalies, and ingestion delays.

18 rules 2900 downloads4.0 avg (227)

etlpipelinemonitoringairflowdbtsparkobservabilitydata-ops

4.0(227 ratings)

Test this pack with your data

Download the template, fill in your data, and see quality results instantly.

Test This Pack

Download & Install

Choose your tool — get a ready-to-run file

Run this on your data? Upload your CSV — we'll auto-map the columns, validate, and report the bad rows.Test my data

Or use the CLI

$ npx dqhub install etl-pipeline-monitoring --format soda --table YOUR_TABLE

About this pack

Operational data quality checks for ETL/ELT pipeline monitoring. Covers: - Volume: row count thresholds, growth rate monitoring, anomaly detection - Freshness: table staleness, partition freshness, ingestion delay SLAs - Schema: column count validation, expected columns present - Statistical: distribution stability, null rate trending, cardinality drift - Completeness: required field coverage after each load Designed for data engineers monitoring Airflow, dbt, Spark, or any pipeline.

What's included

7volume rules

4freshness rules

4statistical rules

3completeness rules

Checks included (18)

Table Not Empty

Asserts that a table contains at least one row. This is the most fundamental volume check to confirm that a table has not been accidentally truncated, dropped, or failed to load any data.

Row Count Minimum

Asserts that a table contains at least the specified minimum number of rows. Useful for tables with known baseline volumes where dropping below a threshold indicates a data load issue.

Row Count Range

Asserts that the row count of a table falls within an expected minimum and maximum range. Catches both data loss (too few rows) and data duplication or explosion (too many rows).

Row Count Growth

Asserts that the current row count has not decreased more than the specified percentage compared to the previous run's row count. Detects accidental data loss, failed incremental loads, or unintended deletions between pipeline runs.

Row Count Anomaly Detection

Asserts that the current row count is within the specified number of standard deviations from the historical average. Uses statistical anomaly detection to catch unexpected volume spikes or drops without requiring hard-coded thresholds.

Daily Volume Consistency

Asserts that daily row counts fall within an expected range. Identifies days with abnormally low or high data volumes that may indicate partial loads, duplicate ingestion, or upstream source issues.

Schema Column Count

Asserts that a table has the expected number of columns. Detects unintended schema changes such as dropped columns, added columns from upstream migrations, or schema drift between environments.

Table Freshness

Asserts that a table has been updated within the specified number of hours. Uses the table's metadata (last modified timestamp) or a designated timestamp column to verify data is fresh and pipelines are running on schedule.

Column Max Age

Asserts that the most recent value in a date/timestamp column is within the specified number of hours from the current time. Useful for verifying that new data is arriving as expected in date-partitioned or event-driven tables.

Partition Freshness

Asserts that the latest partition or date value in a partitioned table is within the expected range of the current date. Ensures that daily, hourly, or other periodic data loads are completing on schedule.

Ingestion Delay

Asserts that the time difference between the source event timestamp and the load/ingestion timestamp is within the defined SLA. Detects pipeline lag, backpressure, or ingestion failures that cause data to arrive late.

Null Rate Stable

Asserts that the null rate of a column has not changed more than the specified percentage points from a known baseline. Detects regressions in data completeness that may indicate broken upstream transformations, schema changes, or ETL failures.

Cardinality Check

Asserts that the number of distinct values in a column falls within an expected range. Detects issues such as collapsed categories (too few distinct values), data explosion (too many), or enum drift from upstream changes.

Mean In Range

Asserts that the arithmetic mean of a numeric column falls within an expected range. Detects data drift, calculation errors, or upstream changes that shift the central tendency of key metrics.

Standard Deviation Stable

Asserts that the standard deviation of a numeric column has not changed more than the specified percentage from a known baseline. Detects changes in data variability that may indicate corrupted data, changed source systems, or process failures.

Required Columns Present

Asserts that a table contains all expected columns by name. Catches schema drift, missing columns after migrations, or upstream schema changes before downstream logic breaks.

Column Not Null

Asserts that a specified column contains no null values. This is the most fundamental completeness check — every row must have a value present in the target column.

Column Completeness Threshold

Asserts that a column meets a minimum completeness threshold, measured as the percentage of non-null values. Useful when some nulls are acceptable but the overall population rate must stay above a defined level (e.g., 95%).