Skip to content

kbc llm export: Data lineage graph missing extractors, writers, and implicit SQL references #2523

@frantisekrehor

Description

@frantisekrehor

Problem

The kbc llm export command generates a data lineage graph (indices/graph.jsonl) that only contains table: and transform: node types. Non-transformation components — extractors, writers, and applications — are completely absent from the graph, even though they are the actual data sources and sinks of the pipeline.

Current behavior

  • All nodes in the graph are either table:* or transform:*
  • Missing from graph: extractors, writers, applications — zero representation
  • Edge types: only consumed_by and produces — no edges connecting components to tables they extract into or write from
  • components/index.json correctly catalogs all components, but the lineage graph ignores everything except transformations

Impact on AI agents

The primary consumer of kbc llm export output is AI agents/LLMs. Without extractor/writer edges, an agent cannot:

  1. Trace data origin — "Where does this table come from?" → No answer from lineage (it may be extracted from an external DB, but the graph doesn't show this)
  2. Trace data destination — "Where does this reporting table go?" → No answer (it may feed a writer or PowerBI refresh app, but graph doesn't show this)
  3. Assess blast radius — "If I change this extractor config, what transformations are affected?" → Requires manually cross-referencing bucket names with component configs

Expected behavior

The lineage graph should include edges for all component types:

{"source":"extractor:component-id:config-id","target":"table:bucket/table-name","type":"produces"}
{"source":"table:bucket/table-name","target":"writer:component-id:config-id","type":"consumed_by"}
{"source":"table:bucket/table-name","target":"application:component-id:config-id","type":"consumed_by"}

This would make the lineage graph a true end-to-end representation of the data pipeline.

Additional request: Implicit SQL table references

A secondary (but related) gap: the lineage graph is built solely from explicit input/output mappings declared in transformation configs. However, Snowflake transformations can reference tables by fully-qualified name directly in SQL code (e.g., SELECT * FROM "bucket"."table") without declaring them in the input mapping. These implicit dependencies are invisible to the current lineage graph.

Ideally, the export could optionally perform a lightweight static analysis of SQL code blocks to detect FROM/JOIN clauses referencing fully-qualified table names that are not present in the declared input mapping, and add these as a separate edge type (e.g., type: "implicit_ref").

Environment

  • KBC CLI version: v2.44.0

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions