My experience with data backfilling and achieving reproducibility

Recently, I was busy with some data engineering related things and I want to share some insights.

Make jobs resilient to failure

So you wrote following job that runs daily

SELECT *
FROM item_events
WHERE item_events.item_id in (
    SELECT id
    FROM item
    WHERE item.status='ACTIVE'
) AND item_events.date = {DATE}

and deployed to production using your workflow management tool(ex. Airflow). Some time later you notice that logic is wrong and have to you run job again for last month.

The thing is that can wrong is that you depend on item.status. Some items can have different statuses right now than month ago. On job rerun, events for items that were active in the previous month but are no longer active will not be saved.

You must remember that jobs will always fail for some reason. What is important is that how can you achieve correctness of backfilled data.

There are ways to solve this.

Take a snapshot of tables

You can take snapshots of item table for each day.

Example of snapshot data

id,name,status,snapshot_date
1,itemA,ACTIVE,2024-09-13
1,itemA,ACTIVE,2024-09-14
1,itemA,PASSIVE,2024-09-15

New job

SELECT *
FROM item_events
WHERE item_events.item_id in (
    SELECT id
    FROM item_snapshot
    WHERE item.status='ACTIVE' AND snapshot_date={DATE}
) AND item_events.date = {DATE}

As you see, for that old date, item’s status will still be ACTIVE. Thus, you can be sure that on the job rerun it will give same results.

Introduce new values to make job reproducible

You can introduce new fields that will be valid throughout interval: active_start_date and active_end_date.

SELECT *
FROM item_events
WHERE item_events.item_id in (
    SELECT id
    FROM item
    WHERE item.status='ACTIVE' OR (item.active_start_date<={DATE} AND item.active_end_date>={DATE})
) AND item_events.date = {DATE}

On the rerun, it will still give correct results because we did stabilize it by introducing dates where it should be taken into account.

Of course, you can develop custom scripts to backfill data. Downside of this is that your script can contain errors too. However, approaches above will allow to rerun job by only clicking to relevant days from your workflow management tool.

updated_at 15-09-2024