Define formal PQG Parquet Schema specification

## Context

In discussion on [metadata PR #192](https://github.com/isamplesorg/metadata/pull/192#issuecomment-3824431630), we realized that the LinkML schema (`isamples_core.yaml`) and PQG parquet representations serve different purposes:

| Serialization | Purpose | How Relationships Work |
|---------------|---------|------------------------|
| **JSON/JSON-LD** (LinkML) | Document exchange | Nesting provides implicit linkage |
| **PQG Narrow** (parquet) | Graph queries | Explicit edge rows with s/p/o |
| **PQG Wide** (parquet) | Analytical queries | `p__*` columns with row_id arrays |

The LinkML schema is the **conceptual model** and **JSON serialization spec**. But PQG parquet has additional structural requirements that aren't formally documented:

- Every entity row has `row_id` (internal identifier)
- Edge rows have `s`, `p`, `o`, `n` columns
- Wide format has `p__*` columns containing arrays of target row_ids
- All nodes (including GeospatialCoordLocation, SampleRelation) get PIDs in parquet even though they're optional in the JSON schema

## Question

Should we create a formal **PQG Parquet Schema specification** that documents these parquet-specific conventions?

## Preliminary Research: Parquet Schema Standards

### What exists

1. **Parquet's built-in schema** - Files are self-describing (column names, types in footer), but this only describes **structure**, not **semantics**.

2. **[Frictionless Data Package](https://specs.frictionlessdata.io/)** - Most mature standard for dataset metadata:
   - [Table Schema](https://frictionlessdata.io/specs/table-schema/) - columns with types, constraints, descriptions
   - Parquet support [being added in v2](https://github.com/frictionlessdata/frictionless-py/issues/477)
   - Tension: Parquet already has embedded schema (potential duplication)

3. **Data Catalogs** - AWS Glue, Apache Iceberg, Delta Lake have their own metadata layers, but these are platform-specific.

4. **No dominant standard** for "parquet + semantics" - Most approaches are ad-hoc (README files, companion JSON/YAML).

### Possible approach for PQG

| Layer | Format | What It Describes |
|-------|--------|-------------------|
| **Conceptual** | LinkML YAML | Entity types, properties, relationships |
| **JSON serialization** | JSON Schema (from LinkML) | Document structure |
| **Parquet serialization** | Custom spec doc (+ Frictionless Table Schema?) | Column meanings, PQG conventions |

## Proposal

Create a `PQG_SPECIFICATION.md` document in this repo that:

1. References the LinkML conceptual model (don't duplicate it)
2. Documents PQG-specific conventions:
   - Narrow format: `row_id`, `otype`, `s`/`p`/`o`/`n` for edges
   - Wide format: `p__*` columns, relationship to narrow
   - Why all entities get PIDs in parquet (graph traversal) even if optional in JSON
3. Optionally includes a Frictionless Table Schema for machine-readable column definitions

## Questions for Discussion

1. Is a formal spec worth the maintenance overhead?
2. Should we use Frictionless Table Schema, or just markdown documentation?
3. What level of detail is needed? (Column definitions only, or also query patterns?)

cc @smrgeoinfo @datadavev

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define formal PQG Parquet Schema specification #16

Context

Question

Preliminary Research: Parquet Schema Standards

What exists

Possible approach for PQG

Proposal

Questions for Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Serialization	Purpose	How Relationships Work
JSON/JSON-LD (LinkML)	Document exchange	Nesting provides implicit linkage
PQG Narrow (parquet)	Graph queries	Explicit edge rows with s/p/o
PQG Wide (parquet)	Analytical queries	`p__*` columns with row_id arrays

Layer	Format	What It Describes
Conceptual	LinkML YAML	Entity types, properties, relationships
JSON serialization	JSON Schema (from LinkML)	Document structure
Parquet serialization	Custom spec doc (+ Frictionless Table Schema?)	Column meanings, PQG conventions

Define formal PQG Parquet Schema specification #16

Description

Context

Question

Preliminary Research: Parquet Schema Standards

What exists

Possible approach for PQG

Proposal

Questions for Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions