Skip to content

[Bug] Grafting fails with "Unexpected null for non-null column" when source has tables with very few rows #6221

@0x19dG87

Description

@0x19dG87

Bug report

Summary

Grafting fails during initialization with error "Unexpected null for non-null column" when the source subgraph contains tables with very few rows (typically < 10 rows). This occurs because PostgreSQL does not generate histogram_bounds statistics for tables with insufficient data, but the graph-node code assumes histogram_bounds is never NULL.

Steps to Reproduce

  1. Start syncing subgraph that relies on graft that has some entities with very few instances (e.g., 1-3 rows). For instance, QmPGNM9A34PYrCpiEjcuQq6TJuB9PvNcqGU1hKa7oWERpZ and its graft QmVXHLm8YBTxabrA4LG3LVnXxpB8mtVJa9KdAYrCtYE6cF.
  2. Observe the error during grafting: Unexpected null for non-null column

Expected Behavior

Grafting should succeed regardless of the number of rows in source tables. PostgreSQL legitimately returns NULL for histogram_bounds when tables have insufficient data for statistical analysis, and the code should handle this gracefully.

Actual Behavior

Grafting fails immediately after obtaining the copy lock.

Root Cause Analysis

The Bug Location

The bug is in store/postgres/src/catalog.rs in the histogram_bounds() function:

#[derive(Queryable, QueryableByName)]
struct Bounds {
    #[diesel(sql_type = Array<BigInt>)]
    bounds: Vec<i64>,  // ❌ Non-nullable
}

sql_query(QUERY)
    .bind::<Text, _>(namespace.as_str())
    .bind::<Text, _>(table.as_str())
    .bind::<Text, _>(column)
    .get_result::<Bounds>(conn)
    .optional()
    .map(|bounds| bounds.map(|b| b.bounds).unwrap_or_default())
    .map_err(StoreError::from)

Why It Fails

  1. PostgreSQL Behavior: PostgreSQL's pg_stats.histogram_bounds is NULL for tables with very few rows (typically < 100 rows). This is documented behavior - histograms are only created when there's enough data to make them statistically meaningful.

  2. Code Assumption: The Bounds struct defines bounds: Vec<i64> (non-nullable), but PostgreSQL can legitimately return NULL.

  3. Diesel Deserialization: When Diesel tries to deserialize a NULL histogram_bounds value into the non-nullable Vec<i64>, it fails with "Unexpected null for non-null column".

  4. .optional() Limitation: The .optional() call only handles missing rows (when there's no pg_stats entry), not NULL column values within existing rows.

When This Occurs

This bug is triggered when:

  • The source subgraph has tables with < ~100 rows
  • These tables exist in both source and destination schemas (so they will be copied)
  • The table is marked for copying during grafting initialization

Example Case

In my investigation:

  • Table sgd129.initialized had exactly 1 row
  • PostgreSQL correctly returned NULL for histogram_bounds:
    SELECT histogram_bounds FROM pg_stats
    WHERE schemaname = 'sgd129' AND tablename = 'initialized' AND attname = 'vid';
    -- Result: NULL
  • Graph-node code failed during VidBatcher::load()catalog::histogram_bounds()

The Fix

Change the bounds field to be nullable and handle the None case:

#[derive(Queryable, QueryableByName)]
struct Bounds {
    #[diesel(sql_type = Nullable<Array<BigInt>>)]
    bounds: Option<Vec<i64>>,  // ✅ Nullable
}

Workaround

Until this is fixed, I was able to work around the issue by temporarily adding dummy data to the table. Grafting will not copy them, because dummy rows are added beyond graft point.

-- Add dummy rows with block$ AFTER the graft point (so they won't be copied)
INSERT INTO sgd129.small_table (block$, id, ...)
SELECT
    (graft_block + 1000000) as block$,
    -- ... other required fields
FROM generate_series(1, 99) s;

-- Analyze to generate histogram
ANALYZE sgd129.small_table;

-- After grafting succeeds, remove dummy data
DELETE FROM sgd129.small_table WHERE block$ > graft_block;

Environment

  • Graph Node version: 0.41.1
  • PostgreSQL version: 18.1
  • Deployment method: Docker

Relevant log output

Dec 03 11:06:39.081 INFO Obtaining copy lock (this might take a long time if another process is still copying), dst: sgd146, runner_index: 35, sgd: 146, subgraph_id: QmPGNM9A34PYrCpiEjcuQq6TJuB9PvNcqGU1hKa7oWERpZ, component: SubgraphInstanceManager
Dec 03 11:06:39.281 ERRO Failed to start subgraph, code: SubgraphStartFailure, error: store error: Unexpected null for non-null column, runner_index: 35, sgd: 146, subgraph_id: QmPGNM9A34PYrCpiEjcuQq6TJuB9PvNcqGU1hKa7oWERpZ, component: SubgraphInstanceManager

IPFS hash

QmPGNM9A34PYrCpiEjcuQq6TJuB9PvNcqGU1hKa7oWERpZ

Subgraph name or link to explorer

https://thegraph.com/explorer/subgraphs/35Ni9RTU1e2ZY4VvQwANmhpst8NUXUyKmMnv7DnaLxtq?view=Query&chain=arbitrum-one

Some information to help us out

  • Tick this box if this bug is caused by a regression found in the latest release.
  • Tick this box if this bug is specific to the hosted service.
  • I have searched the issue tracker to make sure this issue is not a duplicate.

OS information

None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions