Context
In discussion on metadata PR #192, we realized that the LinkML schema (isamples_core.yaml) and PQG parquet representations serve different purposes:
| Serialization |
Purpose |
How Relationships Work |
| JSON/JSON-LD (LinkML) |
Document exchange |
Nesting provides implicit linkage |
| PQG Narrow (parquet) |
Graph queries |
Explicit edge rows with s/p/o |
| PQG Wide (parquet) |
Analytical queries |
p__* columns with row_id arrays |
The LinkML schema is the conceptual model and JSON serialization spec. But PQG parquet has additional structural requirements that aren't formally documented:
- Every entity row has
row_id (internal identifier)
- Edge rows have
s, p, o, n columns
- Wide format has
p__* columns containing arrays of target row_ids
- All nodes (including GeospatialCoordLocation, SampleRelation) get PIDs in parquet even though they're optional in the JSON schema
Question
Should we create a formal PQG Parquet Schema specification that documents these parquet-specific conventions?
Preliminary Research: Parquet Schema Standards
What exists
-
Parquet's built-in schema - Files are self-describing (column names, types in footer), but this only describes structure, not semantics.
-
Frictionless Data Package - Most mature standard for dataset metadata:
- Table Schema - columns with types, constraints, descriptions
- Parquet support being added in v2
- Tension: Parquet already has embedded schema (potential duplication)
-
Data Catalogs - AWS Glue, Apache Iceberg, Delta Lake have their own metadata layers, but these are platform-specific.
-
No dominant standard for "parquet + semantics" - Most approaches are ad-hoc (README files, companion JSON/YAML).
Possible approach for PQG
| Layer |
Format |
What It Describes |
| Conceptual |
LinkML YAML |
Entity types, properties, relationships |
| JSON serialization |
JSON Schema (from LinkML) |
Document structure |
| Parquet serialization |
Custom spec doc (+ Frictionless Table Schema?) |
Column meanings, PQG conventions |
Proposal
Create a PQG_SPECIFICATION.md document in this repo that:
- References the LinkML conceptual model (don't duplicate it)
- Documents PQG-specific conventions:
- Narrow format:
row_id, otype, s/p/o/n for edges
- Wide format:
p__* columns, relationship to narrow
- Why all entities get PIDs in parquet (graph traversal) even if optional in JSON
- Optionally includes a Frictionless Table Schema for machine-readable column definitions
Questions for Discussion
- Is a formal spec worth the maintenance overhead?
- Should we use Frictionless Table Schema, or just markdown documentation?
- What level of detail is needed? (Column definitions only, or also query patterns?)
cc @smrgeoinfo @datadavev
Context
In discussion on metadata PR #192, we realized that the LinkML schema (
isamples_core.yaml) and PQG parquet representations serve different purposes:p__*columns with row_id arraysThe LinkML schema is the conceptual model and JSON serialization spec. But PQG parquet has additional structural requirements that aren't formally documented:
row_id(internal identifier)s,p,o,ncolumnsp__*columns containing arrays of target row_idsQuestion
Should we create a formal PQG Parquet Schema specification that documents these parquet-specific conventions?
Preliminary Research: Parquet Schema Standards
What exists
Parquet's built-in schema - Files are self-describing (column names, types in footer), but this only describes structure, not semantics.
Frictionless Data Package - Most mature standard for dataset metadata:
Data Catalogs - AWS Glue, Apache Iceberg, Delta Lake have their own metadata layers, but these are platform-specific.
No dominant standard for "parquet + semantics" - Most approaches are ad-hoc (README files, companion JSON/YAML).
Possible approach for PQG
Proposal
Create a
PQG_SPECIFICATION.mddocument in this repo that:row_id,otype,s/p/o/nfor edgesp__*columns, relationship to narrowQuestions for Discussion
cc @smrgeoinfo @datadavev