BigQuery Optimization: From Full Reload to Incremental Processing

Csaba Kassai

2026-04-01

11 min read

Hero image for 'BigQuery Optimization: From Full Reload to Incremental Processing' article

Csaba Kassai · CTO at Rabbit

This post walks through a real optimization pattern applied for a Rabbit client: rewriting scheduled queries from full reloads to incremental MERGE statements. I’ll demonstrate the technique step by step using Google Analytics 4 data (a perfect candidate because event data finalizes within a day and doesn’t change after that), covering how to partition your tables, define a self-healing lookback window, and structure a MERGE query that handles both initial loads and daily incremental runs.

One of the fastest ways to cut BigQuery costs is also one of the simplest: stop reprocessing data that hasn’t changed. Many teams run scheduled queries that drop and recreate entire tables every time they execute — even when 99% of the underlying data is identical to the last run. For a table with a year of history, that means paying to scan 365 days of data just to update today’s rows.

Why do full reloads keep getting more expensive?

A full reload query follows a simple pattern: truncate the target table and rebuild it from scratch using the entire source dataset. It’s easy to write, easy to reason about, and it guarantees consistency.

The problem is cost. BigQuery charges by bytes scanned, and a full reload scans everything, every time. When the source table is small, nobody notices. But source tables grow. A production Google Analytics property generating 330 GB of events per month has a cumulative source table of 330 GB in January — but that table grows to 4 TB by December as each month’s data piles up. The scheduled query that cost $2/day in January costs $24/day by year-end — for the exact same logic.

At a Rabbit client, we identified scheduled queries following the full reload pattern and rewrote them to incremental processing. The result: over 95% reduction in scanned bytes and query costs for the affected queries.

The example: a session summary table from GA4 events

Let’s make this concrete. We’ll use the public GA4 sample dataset (bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*) to build a daily session summary table — total events, page views, and purchases per session per day.

This is a typical analytics pipeline pattern: raw event data flows into a summary table that dashboards and reports query downstream.

The full reload version of the query

Here’s the straightforward approach — rebuild the entire table every run:

CREATE OR REPLACE TABLE `my_project.analytics.session_summary`
PARTITION BY event_date
AS
SELECT
  PARSE_DATE('%Y%m%d', event_date)                       AS event_date,
  user_pseudo_id,
  (SELECT value.int_value
     FROM UNNEST(event_params)
    WHERE key = 'ga_session_id')                          AS session_id,
  COUNT(*)                                                AS total_events,
  COUNTIF(event_name = 'page_view')                       AS page_views,
  COUNTIF(event_name = 'purchase')                        AS purchases
FROM
  `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
GROUP BY
  event_date, user_pseudo_id, session_id;

This query scans the full source table on every execution. On the public sample dataset (~4.3 million rows across ~3 months), that’s manageable. But imagine the same pattern against a production GA4 property with a year or more of history. Every scheduled run pays the full price.

How to convert a full reload query to incremental processing?

Step 1: partition both source and target

The foundation of incremental processing in BigQuery is partitioning. If your target table isn’t partitioned, you can’t efficiently update a subset of it. If the source isn’t partitioned, you can’t efficiently read a subset either.

The GA4 export tables are already date-sharded — each day is a separate table with an events_YYYYMMDD suffix. This isn’t the same as BigQuery’s native partitioning (which stores segments within a single table), but filtering with _TABLE_SUFFIX achieves a similar pruning effect: BigQuery only reads the tables that match. Our target table above is partitioned by event_date. That means BigQuery can prune both reads and writes to only the date range we care about.

If your source table isn’t partitioned by date, this is the first thing to fix. Without partition pruning on the source side, an incremental query still scans everything.

Step 2: define the incremental window

Next, decide how many days to reprocess. This step rests on a crucial prerequisite: source data may only change within a bounded time window during normal operation. Records older than that window must be effectively final. If past data can be modified indefinitely, there is no safe cutoff and the full dataset must always be scanned, which defeats the purpose. The lookback window you choose must fully cover this change period.

For GA4, intraday tables (events_intraday_YYYYMMDD) update throughout the day and finalize overnight. A safe window is 3 days, covering:

Today’s intraday data (still updating)
Yesterday’s data (freshly finalized)
One extra day as a safety buffer

A naïve approach would hardcode the window relative to today (DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)), but that breaks if the pipeline ever falls behind: a weekend outage means Monday’s run only covers the last 3 days and misses Friday. Instead, we derive the window from the target table itself:

DECLARE lookback_start DATE;
DECLARE lookback_end DATE DEFAULT CURRENT_DATE();

SET lookback_start = (
  SELECT IFNULL(DATE_SUB(MAX(event_date), INTERVAL 3 DAY), DATE('2000-01-01'))
  FROM `my_project.analytics.session_summary`
);

This makes the query self-healing: if the target table is empty (first run or after a reset), MAX(event_date) returns NULL and the IFNULL fallback scans the entire source — performing a full initial load automatically. On subsequent runs, it picks up from 3 days before the latest data in the target, regardless of when the last run actually happened. A multi-day outage just means the next run’s window is wider, not that data gets skipped.

Step 3: rewrite as `MERGE`

Now we replace the full CREATE OR REPLACE with a MERGE statement that uses the dynamic window to touch only the data that needs updating:

DECLARE lookback_start DATE;
DECLARE lookback_end DATE DEFAULT CURRENT_DATE();

SET lookback_start = (
  SELECT IFNULL(DATE_SUB(MAX(event_date), INTERVAL 3 DAY), DATE('2000-01-01'))
  FROM `my_project.analytics.session_summary`
);

MERGE `my_project.analytics.session_summary` AS target
USING (
  SELECT
    PARSE_DATE('%Y%m%d', event_date)                       AS event_date,
    user_pseudo_id,
    (SELECT value.int_value
       FROM UNNEST(event_params)
      WHERE key = 'ga_session_id')                          AS session_id,
    COUNT(*)                                                AS total_events,
    COUNTIF(event_name = 'page_view')                       AS page_views,
    COUNTIF(event_name = 'purchase')                        AS purchases
  FROM
    `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
  WHERE
    _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d', lookback_start)
                       AND FORMAT_DATE('%Y%m%d', lookback_end)
  GROUP BY
    event_date, user_pseudo_id, session_id
) AS source
ON  target.event_date  = source.event_date
AND target.user_pseudo_id = source.user_pseudo_id
AND target.session_id     = source.session_id
WHEN MATCHED THEN
  UPDATE SET
    total_events = source.total_events,
    page_views   = source.page_views,
    purchases    = source.purchases
WHEN NOT MATCHED BY TARGET THEN
  INSERT (event_date, user_pseudo_id, session_id, total_events, page_views, purchases)
  VALUES (source.event_date, source.user_pseudo_id, source.session_id,
          source.total_events, source.page_views, source.purchases)
WHEN NOT MATCHED BY SOURCE
  AND target.event_date BETWEEN lookback_start AND lookback_end THEN
  DELETE;

Let’s break down what changed:

Self-healing window: The SET lookback_start query reads the target table to determine where to resume. On an empty table this scans zero bytes; on a populated table it reads only partition metadata. The result is a single query that handles both initial load and daily incremental runs.
Source pruning: The WHERE _TABLE_SUFFIX BETWEEN ... clause tells BigQuery to only read the sharded tables within the lookback window. This is where the cost savings come from: on daily runs, we scan only a few days’ worth of source data instead of the full history.
Full upsert + delete logic: WHEN MATCHED THEN UPDATE handles re-processing existing days (yesterday’s data might have been finalized since the last run). WHEN NOT MATCHED BY TARGET THEN INSERT adds new rows for fresh data. WHEN NOT MATCHED BY SOURCE THEN DELETE removes target rows within the lookback window that no longer exist in the source — the event_date BETWEEN guard ensures rows outside the window are left untouched.

What this saves in practice

We ran the self-healing MERGE against the public GA4 sample dataset (92 days of data, ~4.3 million events) to get real numbers. The same query was executed twice: first against an empty target table (initial load), then immediately again (subsequent daily run).

	Full reload (`CREATE OR REPLACE`)	`MERGE` — initial load	`MERGE` — subsequent run
Bytes processed	990 MB	990 MB	55 MB
Bytes billed	991 MB	991 MB	62 MB
Slot time	49 s	53 s	23 s
Wall clock time	~6 s	~10 s	~5 s

The initial load costs the same as a full reload — there’s no free lunch when the target table is empty and all source data must be scanned. But that cost is paid once. Every subsequent run processes 94% less data, and this gap only widens as the dataset grows:

Dataset size	Full reload scans	Incremental (3-day) scans	Reduction
3 months	~90 days	3 days	~97%
6 months	~180 days	3 days	~98%
1 year	~365 days	3 days	~99%

To put this in dollar terms (BigQuery on-demand pricing at $6.25/TiB — check current pricing for your region), consider a 100 GB table queried once daily by a scheduled query:

Full reload: 100 GB × $6.25/TiB = ~$0.61/day → ~$18/month
Incremental (3-day window on a 1-year table): ~0.82 GB × $6.25/TiB = ~$0.005/day → ~$0.15/month

These numbers compound across every scheduled query in your project. At a Rabbit client, we converted multiple scheduled queries from full reloads to incremental processing and achieved over 95% cost reduction on the affected queries.

Does incremental processing also cut your BigQuery storage bill?

There’s a second cost benefit that’s easy to overlook. BigQuery offers two storage pricing tiers: active storage for data modified in the last 90 days, and long-term storage for data untouched for 90+ days, which is roughly half the price.

When you run a full reload with CREATE OR REPLACE, every partition in the target table is rewritten on every run. That resets the modification timestamp for the entire table, so nothing ever qualifies for long-term storage pricing. You’re permanently stuck paying the active storage rate for your entire dataset.

With incremental processing, only the lookback window partitions get touched. Older partitions remain unmodified and automatically transition to long-term storage after 90 days. For a table with a year of history where you only update the last 3 days, roughly 75% of your partitions will be in long-term storage at any given time, cutting your storage bill nearly in half compared to the full reload approach.

Dive deeper into optimizing BigQuery storage:
How to Choose the Right BigQuery Storage Pricing Model (Logical vs Physical) to Save up to 60%

When incremental processing won’t work

This technique relies on one key assumption: source data stabilizes within a known, bounded time window. GA4 event exports are a perfect fit: intraday data updates throughout the day, but once a day’s data is finalized (typically overnight), it’s immutable. A 3-day lookback window comfortably covers this change period.

Be careful with sources where past data can be modified indefinitely or unpredictably — for example, CRM tables where deal stages change weeks after creation, or financial data subject to late reconciliation adjustments. If there’s no reliable upper bound on how far back changes can reach, there’s no safe lookback window, and incremental processing won’t give you correct results.

How does Rabbit help you find the right queries to optimize?

The optimization described above is effective but manual. You need to audit each scheduled query, determine the right lookback window, rewrite the SQL, test it, and deploy it. Across a project with dozens of scheduled queries, that’s a significant effort — and the hardest part is knowing where to start.

This is where Rabbit helps. Rabbit analyzes your BigQuery workloads and surfaces the most expensive scheduled queries, so you can focus your optimization effort where it matters most. Instead of auditing every query yourself, you get a prioritized view of which queries scan the most data and cost the most — making it straightforward to identify full-reload candidates and apply the incremental pattern where it will have the biggest impact.

Sign up for free to see which of your scheduled queries could benefit from incremental processing — or book a demo to walk through your specific setup with our team.