Change Data Capture (CDC)¶
Fabricks provides built-in Change Data Capture (CDC) patterns to keep downstream tables synchronized with upstream changes using SQL-first pipelines. CDC is enabled per job via options.change_data_capture and implemented by generated MERGE/INSERT statements driven by small helper columns.
This page explains the supported CDC strategies, required inputs, merge semantics, and examples.
Strategies¶
| Strategy | Description | Convenience views |
|---|---|---|
nocdc |
No CDC; writes the result as-is. | ~ |
scd1 |
Tracks current vs deleted; maintains flags __is_current, __is_deleted. |
{table}__current in Silver |
scd2 |
Slowly Changing Dimension Type 2: validity windows with __valid_from, __valid_to. |
{table}__current in Silver |
What is SCD1?¶
- Definition: Slowly Changing Dimension Type 1 keeps only the current state of each business key. Attribute changes overwrite previous values rather than preserving history.
- Typical columns in Fabricks:
__is_current,__is_deleted. A convenience view{table}__currentin Silver selects current non-deleted rows. - When to use: When downstream consumers only need the
latestvalues and you do not need to answer “as-of†questions or audit historical attribute values.
What is SCD2?¶
- Definition: Slowly Changing Dimension Type 2 preserves the full change history by creating a new versioned row each time attributes change. Each row covers a validity window for a given business key.
- Typical columns in Fabricks:
__valid_from,__valid_to, and__is_current(optional__is_deletedif soft-deletes are modeled). Silver also provides{table}__currentforlatestrows. - When to use: When you must answer “as-of†queries (e.g., “What was the customer segment on 2024â€'03â€'01?â€), analyze changes over time, or maintain auditable history. In Gold SCD2 merges, inputs use
__key,__timestamp,__operationto define change points.
How to enable CDC¶
Set the CDC strategy in the job options:
- job:
step: silver
topic: demo
item: `scd1`
options:
mode: `update`
change_data_capture: `scd1`
parents: [bronze.demo_source]
For Gold:
Supported values: nocdc, scd1, scd2.
Input contracts¶
Some helper columns govern CDC behavior. Fabricks generates additional internal helpers during processing.
-
Gold jobs (consumer side):
scd2(required):__key,__timestamp,__operationwith values'upsert' | 'delete' | 'reload'.scd1(required):__key; optional__timestamp/__operation('upsert' | 'delete' | 'reload') for delete/rectify handling.- Note: If
__operationis absent in Gold SCDupdatejobs, Fabricks auto-injects__operation = 'reload'and enables rectification. - Optional helpers used by merges:
__order_duplicate_by_asc/__order_duplicate_by_desc__identity(only whentable_options.identityis not true; ifidentity: true, the identity column is auto-created and you should not supply__identity)__source(to scope merges by logical source)
-
Silver jobs (producer side):
- Provide business keys through job-level
keys(or compute a__key) to support downstream CDC. - Silver can apply CDC directly and yields convenience views (e.g.,
{table}__current).
- Provide business keys through job-level
Note
- Memory outputs ignore columns that start with
__. - Special characters in column names are preserved.
See details in:
- Steps: Silver | Gold
- Table Options
Merge semantics (under the hood)¶
Fabricks compiles CDC operations into SQL via Jinja templates at runtime. The core logic lives in fabricks.cdc:
Merger.get_merge_queryrenderstemplates/merge.sql.jinjafor the selectedchange_data_capturestrategy.- The framework computes internal columns such as:
__merge_condition~ one of'upsert' | 'delete' | 'update' | 'insert'depending on strategy and inputs.__merge_key~ a synthetic key used to join against the target.
- You usually do not set these internal fields manually; they are derived from your inputs (
__key,__operation,__timestamp) and job options.
Join keys
- If a
__keycolumn exists in the target, merges uset.__key = s.__merge_key. - Otherwise, the configured
keysoption is used to build an equality join on business keys.
Source scoping
- If
__sourceexists in both sides, merges addt.__source = s.__sourceto support multi-source data in the same table.
Soft delete vs hard delete (SCD1)
- If the incoming data contains
__is_deleted, the SCD1 template performs soft deletes:- Sets
__is_current = false,__is_deleted = trueon delete.
- Sets
- If
__is_deletedis absent, deletes are physical for SCD1.
Timestamps and metadata
- If the incoming data provides
__timestamp, it is propagated to the target. - If the target has
__metadata, theupdatedtimestamp is set to the current time during updates/deletes.
Identity and hash
- If
table_options.identity: true, the identity column is created automatically when the table is created. - If
table_options.identityis not true and__identityis present in the input, it will be written as a regular column. - If
__hashis present, it is updated during upsert operations.
Update filtering
options.update_wherecan constrain rows affected during merges (useful for limiting the scope of updates).
Internals reference
framework/fabricks/cdc/base/merger.py- Templates under
framework/fabricks/cdc/templates/merge/*.sql.jinja
SCD1 details¶
Behavior (see merge/scd1.sql.jinja)
- Upsert (
__merge_condition = 'upsert'): updates matching rows and inserts nonâ€'matching rows. - Delete (
__merge_condition = 'delete'):- Soft delete if
__is_deletedis part of the schema: sets__is_current = false,__is_deleted = true. - Otherwise, performs a physical delete.
- Soft delete if
Convenience view (in Silver)
{table}__current: filters current (nonâ€'deleted) rows for simplified consumption.
Minimal Silver example
Gold consumption example
-- Example: consuming current rows from SCD1 silver output
select id as id, name as monarch
from silver.monarch_scd1__current
Reload operation¶
'reload' is a CDC input operation used as a reconciliation boundary. It is not a direct MERGE action; instead, it signals Fabricks to rectify the target table using the supplied snapshot at that timestamp.
- Purpose: mark a full or partial snapshot boundary so missing keys can be treated as deletes and present keys as upserts as needed.
- Auto-injection: when a Gold SCD job runs in
mode:update` and your SQL does not provide__operation, Fabricks injects__operation = 'reload'` and turns on rectification. - Silver behavior:
- If a batch contains
'reload'after the target's max timestamp, Silver enables rectification logic. - In
mode:latest`,'reload'` is not allowed and will be rejected.
- If a batch contains
- Gold behavior:
- Passing
reload=Trueto a Gold job run triggers a fullcompletewrite for that run.
- Passing
- Internals: rectification is computed in
framework/fabricks/cdc/templates/query/rectify.sql.jinja, which computes next operations/windows around'reload'markers.
Warning
In Silver mode:latest, `'reload'` is forbidden and will be rejected.
Use `mode: `update or mode:append` instead if you need reconciliation behavior.
Tip
You generally do not need to emit 'reload' manually in Gold SCD update jobs; it is injected for you when __operation is missing.
For explicit control, you can produce rows with __operation = 'reload' at the snapshot timestamp.
SCD2 details¶
Behavior (see merge/scd2.sql.jinja):
- Update (
__merge_condition = 'update'): closes the current row by setting__valid_to = __valid_from - 1 second,__is_current = false. A subsequent insert creates the new current row. - Delete (
__merge_condition = 'delete'): closes the current row and sets__is_current = false(and__is_deleted = trueif soft delete is modeled). - Insert (
__merge_condition = 'insert'): inserts a new current row.
Required Gold inputs:
__key,__timestamp,__operationwith values'upsert' | 'delete' | 'reload'.
Reload notes:
'reload'marks a reconciliation boundary; Fabricks derives concrete actions (e.g., closing current rows, inserting new ones, deleting missing keys) across that boundary.- If you omit
__operationin Gold SCDupdatejobs, Fabricks injects'reload'and enables rectification automatically. - In Silver:
- Presence of
'reload'(beyond target's max timestamp) enables rectification. 'reload'is forbidden inmode:latest``.
- Presence of
Optional features:
options.correct_valid_from: adjusts start timestamps for validity windows.options.persist_last_timestamp: persists last processed timestamp for incremental loads.
Convenience view:
{table}__current: returns only thelatest(current) rows per business key.
Minimal Silver example:
Note: Credit ~ Temporal Snapshot Fact Table (SQLBits 2012). Recommended to watch to understand SCD2 snapshot-based modeling concepts.
Slides: Temporal Snapshot Fact Tables (slides)
Gold input construction example:
-- Turn SCD2 changes into Gold input operations
with dates as (
select id as id, __valid_from as __timestamp, 'upsert' as __operation
from silver.monarch_scd2 where __valid_from > '1900-01-02'
union
select id as id, __valid_to as __timestamp, 'delete' as __operation
from silver.monarch_scd2 where __is_deleted
)
select
d.id as __key,
s.id as id,
s.name as monarch,
s.doubleField as value,
d.__operation,
if(d.__operation = 'delete', d.__timestamp + interval 1 second, d.__timestamp) as __timestamp
from dates d
left join silver.monarch_scd2 s
on d.id = s.id and d.__timestamp between s.__valid_from and s.__valid_to
Keys and __key¶
- Preferred: produce a stable
__keyin your SELECTs (e.g., UDF that hashes business keys). - Alternative: configure
options.keys: [ ... ]to specify business keys. Fabricks derives join predicates from these when__keyis not present.
Tip
- Only provide
__identitywhentable_options.identityis not true. Ifidentity: true, the identity column is auto-created when the table is created; do not include__identity. See Table Options.
Examples¶
Silver SCD1 with duplicates handling
- job:
step: silver
topic: princess
item: order_duplicate
options:
mode: `update`
change_data_capture: `scd1`
order_duplicate_by:
order_by: desc
Gold SCD1 update with incremental timestamp
- job:
step: gold
topic: `scd1`
item: last_timestamp
options:
change_data_capture: `scd1`
mode: `update`
persist_last_timestamp: true
Operational notes¶
- Streaming: where supported,
options.stream: trueenables incremental semantics. - Parents: use
options.parentsto order upstream dependencies. - Checks: configure quality gates via
options.check_options. See Checks & Data Quality.
Related¶
- Steps: Silver | Gold
- Reference: Table Options, Extenders, UDFs & Parsers
- Sample runtime: Sample runtime