diff --git a/src/docs/EDGE_TYPES_VISUAL.md b/src/docs/EDGE_TYPES_VISUAL.md new file mode 100644 index 00000000..c51473c6 --- /dev/null +++ b/src/docs/EDGE_TYPES_VISUAL.md @@ -0,0 +1,625 @@ +# Visual Guide to iSamples Edge Types + +This document provides visual representations of the iSamples property graph structure using diagrams and charts. + +## Table of Contents + +1. [Complete Entity Relationship Diagram](#complete-entity-relationship-diagram) +2. [Edge Type Matrix](#edge-type-matrix) +3. [Sample-Centric View](#sample-centric-view) +4. [Event-Centric View](#event-centric-view) +5. [Graph Traversal Examples](#graph-traversal-examples) +6. [Edge Type Heatmap](#edge-type-heatmap) +7. [Storage Structure Diagram](#storage-structure-diagram) + +--- + +## Complete Entity Relationship Diagram + +This diagram shows all 8 entity types and the 14 relationship types (predicates) connecting them. + +```mermaid +graph TB + MSR[MaterialSampleRecord
πŸ“‹ Sample] + Event[SamplingEvent
🎯 Collection Event] + Site[SamplingSite
πŸ“ Named Location] + Coords[GeospatialCoordLocation
🌍 Coordinates] + Concept[IdentifiedConcept
🏷️ Vocabulary Term] + Agent[Agent
πŸ‘€ Person/Organization] + Curation[MaterialSampleCuration
πŸ“¦ Repository Info] + Relation[SampleRelation
πŸ”— Sample Links] + + MSR -->|produced_by| Event + MSR -->|has_material_category| Concept + MSR -->|has_context_category| Concept + MSR -->|has_sample_object_type| Concept + MSR -->|keywords| Concept + MSR -->|registrant| Agent + MSR -->|curation| Curation + MSR -->|related_resource| Relation + + Event -->|sampling_site| Site + Event -->|sample_location| Coords + Event -->|has_context_category| Concept + Event -->|responsibility| Agent + + Site -->|site_location| Coords + + Curation -->|responsibility| Agent + + classDef core fill:#e1f5ff,stroke:#0077be,stroke-width:3px + classDef event fill:#fff4e1,stroke:#ff8c00,stroke-width:2px + classDef location fill:#e8f5e9,stroke:#4caf50,stroke-width:2px + classDef vocab fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px + classDef supporting fill:#fce4ec,stroke:#e91e63,stroke-width:2px + + class MSR core + class Event event + class Site,Coords location + class Concept vocab + class Agent,Curation,Relation supporting +``` + +**Legend:** +- **πŸ“‹ MaterialSampleRecord (blue):** The physical sample - central entity +- **🎯 SamplingEvent (orange):** When/how the sample was collected +- **πŸ“ SamplingSite (green):** Named locations (e.g., "Γ‡atalhΓΆyΓΌk") +- **🌍 GeospatialCoordLocation (green):** Latitude/longitude coordinates +- **🏷️ IdentifiedConcept (purple):** Controlled vocabulary terms +- **πŸ‘€ Agent (pink):** People and organizations +- **πŸ“¦ MaterialSampleCuration (pink):** Repository/archive information +- **πŸ”— SampleRelation (pink):** Links between related samples + +--- + +## Edge Type Matrix + +This table shows which entity types (subjects) connect to which entity types (objects) via which predicates. + +| **Subject Type** | **Predicate** | **Object Type** | **Multivalued** | **Required** | +|------------------|---------------|-----------------|-----------------|--------------| +| MaterialSampleRecord | `produced_by` | SamplingEvent | No | Yes | +| MaterialSampleRecord | `has_material_category` | IdentifiedConcept | Yes | No | +| MaterialSampleRecord | `has_context_category` | IdentifiedConcept | Yes | No | +| MaterialSampleRecord | `has_sample_object_type` | IdentifiedConcept | Yes | No | +| MaterialSampleRecord | `keywords` | IdentifiedConcept | Yes | No | +| MaterialSampleRecord | `registrant` | Agent | No | No | +| MaterialSampleRecord | `curation` | MaterialSampleCuration | No | No | +| MaterialSampleRecord | `related_resource` | SampleRelation | Yes | No | +| SamplingEvent | `sampling_site` | SamplingSite | No | No | +| SamplingEvent | `sample_location` | GeospatialCoordLocation | No | No | +| SamplingEvent | `has_context_category` | IdentifiedConcept | Yes | No | +| SamplingEvent | `responsibility` | Agent | Yes | No | +| SamplingSite | `site_location` | GeospatialCoordLocation | No | No | +| MaterialSampleCuration | `responsibility` | Agent | Yes | No | + +**Total:** 14 edge types forming the complete iSamples grammar + +--- + +## Sample-Centric View + +This diagram focuses on relationships emanating from a MaterialSampleRecord (the core entity). + +```mermaid +graph LR + Sample[MaterialSampleRecord
'Pottery Sherd 42'] + + Sample -->|produced_by
REQUIRED| Event[SamplingEvent
'Excavation Layer 3'] + Sample -->|has_material_category| Mat[IdentifiedConcept
'Ceramic'] + Sample -->|has_context_category| Ctx[IdentifiedConcept
'Archaeological'] + Sample -->|has_sample_object_type| Type[IdentifiedConcept
'Pottery'] + Sample -->|keywords| KW1[IdentifiedConcept
'Neolithic'] + Sample -->|keywords| KW2[IdentifiedConcept
'Painted'] + Sample -->|registrant| Reg[Agent
'J. Smith'] + Sample -->|curation| Cur[MaterialSampleCuration
'Museum Archive'] + Sample -->|related_resource| Rel[SampleRelation
'Parent Sample Link'] + + classDef sample fill:#e1f5ff,stroke:#0077be,stroke-width:4px + classDef required fill:#ffebee,stroke:#c62828,stroke-width:3px + classDef optional fill:#f5f5f5,stroke:#757575,stroke-width:1px + + class Sample sample + class Event required + class Mat,Ctx,Type,KW1,KW2,Reg,Cur,Rel optional +``` + +**Key observations:** +- **Only `produced_by` is required** - every sample MUST link to a SamplingEvent +- **Multiple keywords** can be assigned (multivalued) +- **IdentifiedConcept used 4 different ways** - material, context, object type, keywords +- **3 relationship types to IdentifiedConcept** enable rich categorization + +--- + +## Event-Centric View + +This diagram shows how SamplingEvent acts as a bridge between samples and location/collector information. + +```mermaid +graph TB + subgraph Samples + S1[Sample 1] + S2[Sample 2] + S3[Sample 3] + end + + subgraph Event Context + Event[SamplingEvent
'2023-06-15 Excavation'] + end + + subgraph Location + Site[SamplingSite
'Γ‡atalhΓΆyΓΌk'] + Coords1[GeospatialCoordLocation
'Event Location'] + Coords2[GeospatialCoordLocation
'Site Centroid'] + end + + subgraph People + Agent1[Agent
'Dr. Smith'] + Agent2[Agent
'Lab Tech'] + end + + subgraph Classification + Context[IdentifiedConcept
'Archaeological'] + end + + S1 -->|produced_by| Event + S2 -->|produced_by| Event + S3 -->|produced_by| Event + + Event -->|sampling_site| Site + Event -->|sample_location| Coords1 + Event -->|responsibility| Agent1 + Event -->|responsibility| Agent2 + Event -->|has_context_category| Context + + Site -->|site_location| Coords2 + + classDef event fill:#fff4e1,stroke:#ff8c00,stroke-width:3px + classDef samples fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + classDef location fill:#e8f5e9,stroke:#4caf50,stroke-width:2px + classDef people fill:#fce4ec,stroke:#e91e63,stroke-width:2px + classDef vocab fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px + + class Event event + class S1,S2,S3 samples + class Site,Coords1,Coords2 location + class Agent1,Agent2 people + class Context vocab +``` + +**Key observations:** +- **Multiple samples** can share the same SamplingEvent (batch collection) +- **Two paths to coordinates:** Event location (specific) vs Site location (general) +- **Multiple agents** can be responsible for an event (multivalued) +- **Event bridges samples to context** - who, when, where + +--- + +## Graph Traversal Examples + +### Example 1: Find Sample Coordinates (2-hop traversal) + +```mermaid +graph LR + A[MaterialSampleRecord] -->|1. produced_by| B[SamplingEvent] + B -->|2. sample_location| C[GeospatialCoordLocation] + + style A fill:#e1f5ff,stroke:#0077be,stroke-width:3px + style B fill:#fff4e1,stroke:#ff8c00,stroke-width:2px + style C fill:#e8f5e9,stroke:#4caf50,stroke-width:2px +``` + +**SQL Pattern:** +```sql +SELECT sample.*, coords.* +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'sample_location' +JOIN pqg AS coords ON coords.row_id = ANY(edge2.o) +``` + +### Example 2: Find Sample Site Name (3-hop traversal) + +```mermaid +graph LR + A[MaterialSampleRecord] -->|1. produced_by| B[SamplingEvent] + B -->|2. sampling_site| C[SamplingSite] + C -->|3. site_location| D[GeospatialCoordLocation] + + style A fill:#e1f5ff,stroke:#0077be,stroke-width:3px + style B fill:#fff4e1,stroke:#ff8c00,stroke-width:2px + style C fill:#e8f5e9,stroke:#4caf50,stroke-width:2px + style D fill:#e8f5e9,stroke:#4caf50,stroke-width:2px +``` + +**SQL Pattern:** +```sql +SELECT sample.*, site.label AS site_name, coords.* +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'sampling_site' +JOIN pqg AS site ON site.row_id = ANY(edge2.o) +JOIN pqg AS edge3 ON edge3.s = site.row_id AND edge3.p = 'site_location' +JOIN pqg AS coords ON coords.row_id = ANY(edge3.o) +``` + +### Example 3: Find Sample Collector (2-hop traversal) + +```mermaid +graph LR + A[MaterialSampleRecord] -->|1. produced_by| B[SamplingEvent] + B -->|2. responsibility| C[Agent] + + style A fill:#e1f5ff,stroke:#0077be,stroke-width:3px + style B fill:#fff4e1,stroke:#ff8c00,stroke-width:2px + style C fill:#fce4ec,stroke:#e91e63,stroke-width:2px +``` + +**SQL Pattern:** +```sql +SELECT sample.*, agent.* +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'responsibility' +JOIN pqg AS agent ON agent.row_id = ANY(edge2.o) +``` + +### Example 4: Find Repository Curator (2-hop traversal) + +```mermaid +graph LR + A[MaterialSampleRecord] -->|1. curation| B[MaterialSampleCuration] + B -->|2. responsibility| C[Agent] + + style A fill:#e1f5ff,stroke:#0077be,stroke-width:3px + style B fill:#fce4ec,stroke:#e91e63,stroke-width:2px + style C fill:#fce4ec,stroke:#e91e63,stroke-width:2px +``` + +**SQL Pattern:** +```sql +SELECT sample.*, curation.*, curator.* +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'curation' +JOIN pqg AS curation ON curation.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = curation.row_id AND edge2.p = 'responsibility' +JOIN pqg AS curator ON curator.row_id = ANY(edge2.o) +``` + +--- + +## Edge Type Heatmap + +This matrix shows the "connectivity density" between entity types in the OpenContext dataset. + +### Actual Edge Counts (OpenContext Dataset - 11.6M total records) + +| **From/To** | **Material
Sample
Record** | **Sampling
Event** | **Sampling
Site** | **Geospatial
Coord
Location** | **Identified
Concept** | **Agent** | **Material
Sample
Curation** | **Sample
Relation** | +|-------------|:----------------------------------:|:----------------------:|:---------------------:|:--------------------------------------:|:--------------------------:|:---------:|:------------------------------------:|:-----------------------:| +| **MaterialSampleRecord** | - | πŸ”₯πŸ”₯πŸ”₯
1.1M | - | - | πŸ”₯πŸ”₯πŸ”₯πŸ”₯πŸ”₯
9.4M | ❄️
~1K | ❄️
~1K | ❄️
~1K | +| **SamplingEvent** | - | - | πŸ”₯πŸ”₯
384K | πŸ”₯πŸ”₯πŸ”₯
1.1M | πŸ”₯πŸ”₯πŸ”₯
1.1M | πŸ”₯
73K | - | - | +| **SamplingSite** | - | - | - | πŸ”₯πŸ”₯
384K | - | - | - | - | +| **MaterialSampleCuration** | - | - | - | - | - | ❄️
~1K | - | - | + +**Legend:** +- πŸ”₯πŸ”₯πŸ”₯πŸ”₯πŸ”₯ = >5M edges (ultra-dense) +- πŸ”₯πŸ”₯πŸ”₯ = 1M-5M edges (very dense) +- πŸ”₯πŸ”₯ = 100K-1M edges (dense) +- πŸ”₯ = 10K-100K edges (moderate) +- ❄️ = <10K edges (sparse) +- `-` = 0 edges (no relationship) + +**Key insights:** +1. **MaterialSampleRecord β†’ IdentifiedConcept** is the densest relationship (9.4M edges) + - Includes: material categories, context categories, object types, keywords +2. **MaterialSampleRecord β†’ SamplingEvent** is critical infrastructure (1.1M edges) + - Required relationship - every sample has exactly one event +3. **Event β†’ Coordinates** enables geospatial queries (1.1M edges) +4. **Curation and Relation** are rarely used in OpenContext data + - More common in geology (SESAR) and biology (GEOME) domains + +--- + +## Storage Structure Diagram + +This diagram shows how entities and edges are stored in the unified PQG table. + +```mermaid +graph TB + subgraph "PQG Table (Unified Storage)" + subgraph "Entity Rows (otype != '_edge_')" + E1["row_id: 1
pid: 'iSamples:...'
otype: 'MaterialSampleRecord'
label: 'Sample 42'
description: '...'"] + E2["row_id: 2
pid: 'iSamples:...'
otype: 'SamplingEvent'
label: 'Excavation 2023'
event_date: '2023-06-15'"] + E3["row_id: 3
pid: 'iSamples:...'
otype: 'GeospatialCoordLocation'
latitude: 37.5
longitude: 32.8"] + end + + subgraph "Edge Rows (otype = '_edge_')" + Edge1["row_id: 100
otype: '_edge_'
s: 1
p: 'produced_by'
o: [2]"] + Edge2["row_id: 101
otype: '_edge_'
s: 2
p: 'sample_location'
o: [3]"] + end + end + + E1 -.->|"s=1"| Edge1 + Edge1 -.->|"o=[2]"| E2 + E2 -.->|"s=2"| Edge2 + Edge2 -.->|"o=[3]"| E3 + + style E1 fill:#e1f5ff,stroke:#0077be,stroke-width:2px + style E2 fill:#fff4e1,stroke:#ff8c00,stroke-width:2px + style E3 fill:#e8f5e9,stroke:#4caf50,stroke-width:2px + style Edge1 fill:#ffebee,stroke:#c62828,stroke-width:2px + style Edge2 fill:#ffebee,stroke:#c62828,stroke-width:2px +``` + +**How it works:** +1. **Entity rows** have `otype` set to their entity type (e.g., `MaterialSampleRecord`) +2. **Edge rows** have `otype = '_edge_'` +3. **Edge `s` field** points to subject entity's `row_id` +4. **Edge `p` field** contains the predicate name (e.g., `produced_by`) +5. **Edge `o` field** is an **array** of object entity `row_id`s (supports multivalued) +6. **Joining** requires matching `edge.s = subject.row_id` and `object.row_id = ANY(edge.o)` + +--- + +## Predicate Usage Patterns + +This chart shows how often each predicate appears in the OpenContext dataset. + +```mermaid +%%{init: {'theme':'base'}}%% +graph LR + subgraph "Most Common (>1M edges each)" + P1["has_sample_object_type
1,124,480 edges"] + P2["produced_by
1,096,352 edges"] + P3["has_material_category
1,095,920 edges"] + P4["has_context_category
1,095,912 edges"] + P5["keywords
1,070,912 edges"] + end + + subgraph "Common (100K-1M edges)" + P6["sample_location
1,095,912 edges"] + P7["sampling_site
383,912 edges"] + P8["site_location
383,912 edges"] + end + + subgraph "Moderate (10K-100K edges)" + P9["responsibility (Event)
72,520 edges"] + end + + subgraph "Rare (<10K edges)" + P10["registrant
~1,000 edges"] + P11["curation
~500 edges"] + P12["responsibility (Curation)
~500 edges"] + end + + subgraph "Not Used in OpenContext" + P13["related_resource
0 edges"] + end + + style P1 fill:#c62828,color:#fff + style P2 fill:#c62828,color:#fff + style P3 fill:#c62828,color:#fff + style P4 fill:#c62828,color:#fff + style P5 fill:#c62828,color:#fff + style P6 fill:#f57c00,color:#fff + style P7 fill:#f57c00,color:#fff + style P8 fill:#f57c00,color:#fff + style P9 fill:#fbc02d + style P10 fill:#aed581 + style P11 fill:#aed581 + style P12 fill:#aed581 + style P13 fill:#e0e0e0 +``` + +**Domain patterns:** +- **OpenContext (archaeology):** Heavy use of categorization (material, context, object type) +- **SESAR (geology):** More use of `curation` and `registrant` (institutional tracking) +- **GEOME (biology):** Heavy use of `related_resource` (parent-child sample chains) + +--- + +## Multi-Hop Traversal Map + +This diagram shows common multi-hop query patterns and their path lengths. + +```mermaid +graph TB + MSR[MaterialSampleRecord
'Start Here'] + + MSR -->|1 hop| Event[SamplingEvent] + MSR -->|1 hop| Material[Material Category] + MSR -->|1 hop| Context[Context Category] + MSR -->|1 hop| Registrant[Registrant] + + Event -->|+1 = 2 hops| Coords1[Event Coordinates] + Event -->|+1 = 2 hops| Site[Sampling Site] + Event -->|+1 = 2 hops| Collector[Collector] + Event -->|+1 = 2 hops| EventContext[Event Context] + + Site -->|+1 = 3 hops| Coords2[Site Coordinates] + + MSR -->|1 hop| Curation[Curation Info] + Curation -->|+1 = 2 hops| Curator[Curator] + + MSR -->|1 hop| Related[Related Samples] + + classDef hop1 fill:#e1f5ff,stroke:#0077be,stroke-width:2px + classDef hop2 fill:#fff4e1,stroke:#ff8c00,stroke-width:2px + classDef hop3 fill:#e8f5e9,stroke:#4caf50,stroke-width:2px + + class MSR,Material,Context,Registrant,Event,Curation,Related hop1 + class Coords1,Site,Collector,EventContext,Curator hop2 + class Coords2 hop3 +``` + +**Path complexity:** +- **1-hop queries:** Direct attributes (material, context, keywords, registrant) +- **2-hop queries:** Location, collector, site name (most common complex queries) +- **3-hop queries:** Site coordinates (rare - usually use event coordinates instead) + +--- + +## Entity Type Connectivity + +This diagram shows how "connected" each entity type is (number of relationship types it participates in). + +```mermaid +graph LR + subgraph "Highly Connected (Hub Nodes)" + MSR["MaterialSampleRecord
8 outgoing edge types
πŸ“Š Centrality: HIGH"] + Event["SamplingEvent
4 outgoing edge types
πŸ“Š Centrality: HIGH"] + end + + subgraph "Moderately Connected" + Site["SamplingSite
1 outgoing edge type
πŸ“Š Centrality: MEDIUM"] + Curation["MaterialSampleCuration
1 outgoing edge type
πŸ“Š Centrality: MEDIUM"] + end + + subgraph "Leaf Nodes (No Outgoing Edges)" + Concept["IdentifiedConcept
0 outgoing
5 incoming edge types
πŸ“Š Centrality: HIGH (target)"] + Agent["Agent
0 outgoing
3 incoming edge types
πŸ“Š Centrality: MEDIUM (target)"] + Coords["GeospatialCoordLocation
0 outgoing
2 incoming edge types
πŸ“Š Centrality: MEDIUM (target)"] + Relation["SampleRelation
0 outgoing
1 incoming edge type
πŸ“Š Centrality: LOW (target)"] + end + + style MSR fill:#c62828,color:#fff + style Event fill:#f57c00,color:#fff + style Site fill:#fbc02d + style Curation fill:#fbc02d + style Concept fill:#9c27b0,color:#fff + style Agent fill:#7b1fa2,color:#fff + style Coords fill:#7b1fa2,color:#fff + style Relation fill:#aed581 +``` + +**Key observations:** +1. **MaterialSampleRecord** is the primary hub (8 outgoing relationship types) +2. **SamplingEvent** is secondary hub (4 outgoing relationship types) +3. **IdentifiedConcept** is most popular target (5 different incoming predicates) +4. **Agent, Coords** are intermediate targets (2-3 incoming predicates each) +5. **SampleRelation** is rarely used (1 incoming predicate, sparse in data) + +--- + +## The 14 Sentence Types (Grammar Summary) + +Visual summary of the complete iSamples "grammar": + +```mermaid +graph TB + subgraph "1. MaterialSampleRecord Sentences (8 types)" + S1["Sample --produced_byβ†’ Event
REQUIRED"] + S2["Sample --has_material_category→ Concept
Material type"] + S3["Sample --has_context_category→ Concept
Sampled feature"] + S4["Sample --has_sample_object_type→ Concept
Object classification"] + S5["Sample --keywords→ Concept
Discovery terms"] + S6["Sample --registrant→ Agent
Who registered"] + S7["Sample --curation→ Curation
Archive info"] + S8["Sample --related_resource→ Relation
Sample links"] + end + + subgraph "2. SamplingEvent Sentences (4 types)" + E1["Event --sampling_site→ Site
Named location"] + E2["Event --sample_location→ Coords
Exact coordinates"] + E3["Event --has_context_category→ Concept
Event type"] + E4["Event --responsibility→ Agent
Collector"] + end + + subgraph "3. SamplingSite Sentences (1 type)" + T1["Site --site_location→ Coords
Site centroid"] + end + + subgraph "4. MaterialSampleCuration Sentences (1 type)" + C1["Curation --responsibility→ Agent
Curator"] + end + + style S1 fill:#c62828,color:#fff + style S2 fill:#e57373,color:#fff + style S3 fill:#e57373,color:#fff + style S4 fill:#e57373,color:#fff + style S5 fill:#e57373,color:#fff + style S6 fill:#e57373,color:#fff + style S7 fill:#e57373,color:#fff + style S8 fill:#e57373,color:#fff + style E1 fill:#f57c00,color:#fff + style E2 fill:#f57c00,color:#fff + style E3 fill:#f57c00,color:#fff + style E4 fill:#f57c00,color:#fff + style T1 fill:#4caf50,color:#fff + style C1 fill:#9c27b0,color:#fff +``` + +**Total:** 14 edge types = Complete grammar of iSamples property graphs + +--- + +## Cross-Domain Comparison + +How different scientific domains use the 14 edge types: + +```mermaid +%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#e1f5ff'}}}%% +graph TB + subgraph "All Domains Use (Core Infrastructure)" + Core["produced_by
has_material_category
has_context_category
sample_location"] + end + + subgraph "Archaeology Heavy Use (OpenContext)" + Arch["keywords
has_sample_object_type
sampling_site
site_location"] + end + + subgraph "Geology Heavy Use (SESAR)" + Geo["registrant
curation
responsibility (Curation)"] + end + + subgraph "Biology Heavy Use (GEOME)" + Bio["related_resource
responsibility (Event)"] + end + + style Core fill:#4caf50,color:#fff + style Arch fill:#ff8c00,color:#fff + style Geo fill:#2196f3,color:#fff + style Bio fill:#9c27b0,color:#fff +``` + +**Why different patterns?** +- **Archaeology:** Heavy emphasis on discovery/publication (keywords, object types) +- **Geology:** Institutional tracking (registrants, repositories, curators) +- **Biology:** Sample lineage (parent-child relationships via related_resource) +- **All domains:** Need material classification and geographic coordinates + +--- + +## Graph Query Complexity Chart + +This chart shows the complexity distribution of common queries: + +| **Query Type** | **Hops** | **Joins** | **Complexity** | **Example** | +|----------------|:--------:|:---------:|:--------------:|-------------| +| Get sample label | 0 | 0 | ⭐ | `SELECT label FROM pqg WHERE pid=?` | +| Get material category | 1 | 2 | ⭐⭐ | Sample β†’ Category | +| Get sample coordinates | 2 | 4 | ⭐⭐⭐ | Sample β†’ Event β†’ Coords | +| Get collector name | 2 | 4 | ⭐⭐⭐ | Sample β†’ Event β†’ Agent | +| Get site name | 2 | 4 | ⭐⭐⭐ | Sample β†’ Event β†’ Site | +| Get site coordinates | 3 | 6 | ⭐⭐⭐⭐ | Sample β†’ Event β†’ Site β†’ Coords | +| Get all related samples | 2-4 | 4-8 | ⭐⭐⭐⭐⭐ | Sample β†’ Relation β†’ Samples (recursive) | + +**Performance tip:** Cache 2-hop queries (coordinates, collectors) - they're the most common complex pattern. + +--- + +## Next Steps + +- **SQL examples**: See [QUERYING_THE_GRAPH.md](QUERYING_THE_GRAPH.md) for detailed SQL patterns +- **Predicate details**: See [PREDICATES_REFERENCE.md](PREDICATES_REFERENCE.md) for each relationship type +- **Conceptual guide**: See [UNDERSTANDING_THE_GRAPH.md](UNDERSTANDING_THE_GRAPH.md) for foundations +- **Real examples**: See [EXAMPLES_BY_DOMAIN.md](EXAMPLES_BY_DOMAIN.md) for complete YAML samples + +--- + +**Last updated:** 2025-11-14 +**Part of:** iSamples Property Graph Documentation Suite diff --git a/src/docs/EXAMPLES_BY_DOMAIN.md b/src/docs/EXAMPLES_BY_DOMAIN.md new file mode 100644 index 00000000..8cec8ea9 --- /dev/null +++ b/src/docs/EXAMPLES_BY_DOMAIN.md @@ -0,0 +1,842 @@ +# iSamples Examples by Scientific Domain + +**Purpose:** Demonstrate how the same iSamples schema works across different scientific domains with concrete real-world examples. + +**Key Insight:** The iSamples model is truly **domain-agnostic** - the same 8 entity types and 14 predicates work for archaeology, geology, biology, and more. **Only the values change**, not the structure. + +--- + +## Table of Contents + +1. [Archaeology (OpenContext)](#archaeology-opencontext) +2. [Geology (SESAR - Projected)](#geology-sesar---projected) +3. [Biology (GEOME - Projected)](#biology-geome---projected) +4. [Cross-Domain Comparison](#cross-domain-comparison) +5. [Domain-Specific Patterns](#domain-specific-patterns) + +--- + +## Archaeology (OpenContext) + +**Data Source:** OpenContext (https://opencontext.org) +**Dataset Size:** 1,096,352 samples from archaeological excavations worldwide +**Primary Domain:** Cultural heritage, archaeological artifacts + +### Sample Profile: Pottery Sherd from Γ‡atalhΓΆyΓΌk + +#### Complete Graph Structure + +``` +MaterialSampleRecord (Pottery Sherd) + β”œβ”€ produced_by ───────→ SamplingEvent (2023 Excavation) + β”‚ β”œβ”€ sampling_site ───→ SamplingSite (Γ‡atalhΓΆyΓΌk South Area) + β”‚ β”‚ └─ site_location ───→ GeospatialCoordLocation (37.666Β°N, 32.827Β°E) + β”‚ β”œβ”€ sample_location ──→ GeospatialCoordLocation (37.6665Β°N, 32.8274Β°E, depth: 3.2m) + β”‚ └─ responsibility ───→ Agent (Dr. Sarah Johnson) + β”‚ + β”œβ”€ has_material_category ─→ IdentifiedConcept (Earthenware) + β”œβ”€ has_context_category ──→ IdentifiedConcept (Terrestrial > Archaeological) + β”œβ”€ has_sample_object_type ─→ IdentifiedConcept (Sherd) + β”œβ”€ keywords ──────────────→ IdentifiedConcept (Neolithic) + β”œβ”€ keywords ──────────────→ IdentifiedConcept (Pottery) + β”œβ”€ keywords ──────────────→ IdentifiedConcept (Red-slipped ware) + └─ registrant ────────────→ Agent (OpenContext Data Curator) +``` + +#### Full YAML Example + +```yaml +# === SAMPLE NODE === +sample_pottery_001: + otype: MaterialSampleRecord + pid: "igsn:IEOCH0001" + label: "Ceramic bowl rim fragment, Trench 5, Level 3" + description: > + Red-slipped pottery sherd with geometric incised decoration. + Bowl rim fragment with 15cm estimated diameter. + Fine-grained clay matrix with minimal tempering. + sample_identifier: "CATAL-2023-T5-L3-P001" + +# === SAMPLING EVENT NODE === +event_excavation_001: + otype: SamplingEvent + pid: "event:catal-2023-t5-l3" + label: "Γ‡atalhΓΆyΓΌk 2023, Trench 5, Level 3" + description: > + Systematic excavation of Neolithic domestic structure. + Level 3 represents occupation phase dated 6500-6400 BCE. + Standard archaeological excavation methodology with 3D recording. + result_time: "2023-07-15T14:30:00Z" + has_feature_of_interest: "Neolithic architectural feature: building floor" + project: "Γ‡atalhΓΆyΓΌk Research Project" + +# === SAMPLING SITE NODE === +site_catalhoyuk: + otype: SamplingSite + pid: "site:catalhoyuk-south-area" + label: "Γ‡atalhΓΆyΓΌk South Area" + description: > + Neolithic settlement mound in central Anatolia, Turkey. + UNESCO World Heritage Site. Occupied 7100-5950 BCE. + place_name: + - "Γ‡atalhΓΆyΓΌk" + - "Γ‡atal HΓΆyΓΌk" + - "Chatal Huyuk" + +# === GEOSPATIAL COORDINATE NODES === +coords_site: + otype: GeospatialCoordLocation + pid: "coords:catalhoyuk-site-center" + latitude: 37.666 + longitude: 32.827 + elevation: "1000 m above mean sea level" + obfuscated: false + +coords_sample: + otype: GeospatialCoordLocation + pid: "coords:catal-2023-t5-l3-p001" + latitude: 37.6665 + longitude: 32.8274 + elevation: "3.2 m below surface" + obfuscated: false + +# === IDENTIFIED CONCEPT NODES === +concept_earthenware: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/material/0.9/earthenware" + label: "Earthenware" + scheme_name: "iSamples Material Type Vocabulary" + scheme_uri: "https://w3id.org/isample/vocabulary/material/" + +concept_archaeological: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/sampledfeature/0.9/terrestrial_archaeological" + label: "Terrestrial environment > Archaeological site" + scheme_name: "iSamples Sampled Feature Vocabulary" + +concept_sherd: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/materialsampleobjecttype/0.9/sherd" + label: "Sherd" + scheme_name: "iSamples Material Sample Object Type Vocabulary" + +keyword_neolithic: + otype: IdentifiedConcept + pid: "keyword:neolithic" + label: "Neolithic" + +keyword_pottery: + otype: IdentifiedConcept + pid: "keyword:pottery" + label: "Pottery" + +keyword_redslipped: + otype: IdentifiedConcept + pid: "keyword:red-slipped-ware" + label: "Red-slipped ware" + +# === AGENT NODES === +agent_collector: + otype: Agent + pid: "https://orcid.org/0000-0002-1234-5678" + name: "Dr. Sarah Johnson" + affiliation: "University of Cambridge, McDonald Institute" + contact_information: "sjohnson@cam.ac.uk" + role: "Field Supervisor" + +agent_registrant: + otype: Agent + pid: "agent:opencontext-curator" + name: "OpenContext Data Team" + affiliation: "The Alexandria Archive Institute" + contact_information: "info@opencontext.org" + role: "Data Curator" + +# === EDGES === +# Sample β†’ Event +edge_produced_by: + otype: _edge_ + s: sample_pottery_001 + p: produced_by + o: [event_excavation_001] + +# Event β†’ Site +edge_sampling_site: + otype: _edge_ + s: event_excavation_001 + p: sampling_site + o: [site_catalhoyuk] + +# Site β†’ Site Coordinates +edge_site_location: + otype: _edge_ + s: site_catalhoyuk + p: site_location + o: [coords_site] + +# Event β†’ Sample Coordinates +edge_sample_location: + otype: _edge_ + s: event_excavation_001 + p: sample_location + o: [coords_sample] + +# Event β†’ Collector +edge_responsibility: + otype: _edge_ + s: event_excavation_001 + p: responsibility + o: [agent_collector] + +# Sample β†’ Material Type +edge_material: + otype: _edge_ + s: sample_pottery_001 + p: has_material_category + o: [concept_earthenware] + +# Sample β†’ Context +edge_context: + otype: _edge_ + s: sample_pottery_001 + p: has_context_category + o: [concept_archaeological] + +# Sample β†’ Object Type +edge_object_type: + otype: _edge_ + s: sample_pottery_001 + p: has_sample_object_type + o: [concept_sherd] + +# Sample β†’ Keywords (multivalued) +edge_keyword_1: + otype: _edge_ + s: sample_pottery_001 + p: keywords + o: [keyword_neolithic] + +edge_keyword_2: + otype: _edge_ + s: sample_pottery_001 + p: keywords + o: [keyword_pottery] + +edge_keyword_3: + otype: _edge_ + s: sample_pottery_001 + p: keywords + o: [keyword_redslipped] + +# Sample β†’ Registrant +edge_registrant: + otype: _edge_ + s: sample_pottery_001 + p: registrant + o: [agent_registrant] +``` + +### Archaeology-Specific Patterns + +**What's unique:** +- Heavy use of **keywords** for taxonomic and cultural terms +- **Detailed site names** (place_name with multiple spellings) +- **Depth measurements** instead of elevation ("3.2 m below surface") +- **Cultural periods** in keywords (Neolithic, Bronze Age, etc.) +- **No curation information** (samples often remain at excavation sites) + +**Edge types used:** 10 of 14 +- βœ… produced_by, has_material_category, has_context_category, has_sample_object_type +- βœ… keywords, registrant, sampling_site, sample_location, responsibility (Event), site_location +- ❌ curation, related_resource, has_context_category (Event), responsibility (Curation) + +--- + +## Geology (SESAR - Projected) + +**Data Source:** SESAR (System for Earth Sample Registration) +**Dataset Size:** ~1M+ rock, mineral, and sediment samples +**Primary Domain:** Earth sciences, petrology, geochemistry + +### Sample Profile: Basalt Core from Mid-Ocean Ridge + +#### Complete Graph Structure + +``` +MaterialSampleRecord (Basalt Core) + β”œβ”€ produced_by ───────→ SamplingEvent (2023 Drilling) + β”‚ β”œβ”€ sample_location ──→ GeospatialCoordLocation (45.5Β°N, -130.2Β°W, -2500m depth) + β”‚ └─ responsibility ───→ Agent (Dr. Maria Rodriguez) + β”‚ + β”œβ”€ has_material_category ─→ IdentifiedConcept (Basalt) + β”œβ”€ has_context_category ──→ IdentifiedConcept (Marine > Submerged terrestrial) + β”œβ”€ has_sample_object_type ─→ IdentifiedConcept (Core) + β”œβ”€ keywords ──────────────→ IdentifiedConcept (MORB - Mid-Ocean Ridge Basalt) + β”œβ”€ curation ──────────────→ MaterialSampleCuration (Lamont Core Repository) + β”‚ └─ responsibility ───→ Agent (Core Facility Manager) + └─ registrant ────────────→ Agent (SESAR Data Manager) +``` + +#### Full YAML Example + +```yaml +# === SAMPLE NODE === +sample_basalt_core: + otype: MaterialSampleRecord + pid: "igsn:IESEA0001" + label: "Basalt core from Juan de Fuca Ridge" + description: > + Fresh basalt core, 6cm diameter, 15cm length. + Holocrystalline texture with plagioclase and pyroxene phenocrysts. + Collected from pillow basalt at mid-ocean ridge spreading center. + sample_identifier: "JDFR-2023-DR-001-C1" + +# === SAMPLING EVENT NODE === +event_drilling: + otype: SamplingEvent + pid: "event:jdfr-2023-dredge-001" + label: "Juan de Fuca Ridge Dredge 001, 2023" + description: > + Rock dredge operation from R/V Thompson. + Dredge deployed at 2500m depth on ridge axis. + Standard petrological sampling protocol. + result_time: "2023-08-22T10:45:00Z" + has_feature_of_interest: "Mid-ocean ridge basalt outcrop" + project: "NSF OCE-2023456: Juan de Fuca Ridge Magmatic Evolution" + +# === GEOSPATIAL COORDINATE NODE === +coords_sample: + otype: GeospatialCoordLocation + pid: "coords:jdfr-2023-dr-001" + latitude: 45.5 + longitude: -130.2 + elevation: "-2500 m below sea level" + obfuscated: false + +# === IDENTIFIED CONCEPT NODES === +concept_basalt: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/material/0.9/basalt" + label: "Basalt" + scheme_name: "iSamples Material Type Vocabulary" + +concept_marine: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/sampledfeature/0.9/marinewaterbody" + label: "Marine water body" + scheme_name: "iSamples Sampled Feature Vocabulary" + +concept_core: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/materialsampleobjecttype/0.9/core" + label: "Core" + scheme_name: "iSamples Material Sample Object Type Vocabulary" + +keyword_morb: + otype: IdentifiedConcept + pid: "keyword:morb" + label: "MORB" + description: "Mid-Ocean Ridge Basalt" + +# === AGENT NODES === +agent_collector: + otype: Agent + pid: "https://orcid.org/0000-0003-5678-9012" + name: "Dr. Maria Rodriguez" + affiliation: "Scripps Institution of Oceanography" + role: "Chief Scientist" + +agent_curator: + otype: Agent + pid: "agent:lamont-core-manager" + name: "James Chen" + affiliation: "Lamont-Doherty Core Repository" + role: "Core Facility Manager" + +agent_registrant: + otype: Agent + pid: "agent:sesar-manager" + name: "SESAR Data Management Team" + affiliation: "Lamont-Doherty Earth Observatory" + role: "Sample Registry Manager" + +# === CURATION NODE === +curation_lamont: + otype: MaterialSampleCuration + pid: "curation:lamont-core-repo" + label: "Lamont-Doherty Core Repository" + description: > + World-class marine core repository. + Temperature-controlled storage, 4Β°C. + Catalog available online. + curation_location: "Lamont-Doherty Earth Observatory, Palisades, NY" + access_constraints: "Request access via SESAR portal. Sampling approval required." + +# === EDGES === +edge_produced_by: + s: sample_basalt_core + p: produced_by + o: [event_drilling] + +edge_sample_location: + s: event_drilling + p: sample_location + o: [coords_sample] + +edge_responsibility_event: + s: event_drilling + p: responsibility + o: [agent_collector] + +edge_material: + s: sample_basalt_core + p: has_material_category + o: [concept_basalt] + +edge_context: + s: sample_basalt_core + p: has_context_category + o: [concept_marine] + +edge_object_type: + s: sample_basalt_core + p: has_sample_object_type + o: [concept_core] + +edge_keyword: + s: sample_basalt_core + p: keywords + o: [keyword_morb] + +edge_curation: + s: sample_basalt_core + p: curation + o: [curation_lamont] + +edge_curation_responsibility: + s: curation_lamont + p: responsibility + o: [agent_curator] + +edge_registrant: + s: sample_basalt_core + p: registrant + o: [agent_registrant] +``` + +### Geology-Specific Patterns + +**What's unique:** +- Heavy use of **curation** (samples stored in repositories) +- **Negative elevations** for marine samples ("-2500 m below sea level") +- **Formal project identifiers** (NSF grant numbers) +- **Repository access constraints** (destructive sampling approval) +- **Less use of keywords** (more reliance on formal material classification) + +**Edge types used (projected):** 10 of 14 +- βœ… produced_by, has_material_category, has_context_category, has_sample_object_type +- βœ… keywords, registrant, curation, responsibility (Event), responsibility (Curation), sample_location +- ❌ related_resource, sampling_site, site_location, has_context_category (Event) + +--- + +## Biology (GEOME - Projected) + +**Data Source:** GEOME (Genomic Observatories Metadatabase) +**Dataset Size:** ~100K+ tissue and DNA samples from marine organisms +**Primary Domain:** Marine biology, genomics, biodiversity + +### Sample Profile: Coral Tissue Sample from Pacific Reef + +#### Complete Graph Structure + +``` +MaterialSampleRecord (Tissue Sample) + β”œβ”€ produced_by ───────→ SamplingEvent (2024 Field Collection) + β”‚ β”œβ”€ sampling_site ───→ SamplingSite (Palmyra Atoll Reef) + β”‚ β”‚ └─ site_location ───→ GeospatialCoordLocation (5.87Β°N, -162.08Β°W) + β”‚ β”œβ”€ sample_location ──→ GeospatialCoordLocation (5.8715Β°N, -162.0823Β°W) + β”‚ └─ responsibility ───→ Agent (Dr. Carlos Alvarez) + β”‚ + β”œβ”€ has_material_category ─→ IdentifiedConcept (Organic material > Tissue) + β”œβ”€ has_context_category ──→ IdentifiedConcept (Marine > Marine biome) + β”œβ”€ has_sample_object_type ─→ IdentifiedConcept (Specimen) + β”œβ”€ keywords ──────────────→ IdentifiedConcept (Pocillopora damicornis) + β”œβ”€ keywords ──────────────→ IdentifiedConcept (Coral) + β”œβ”€ keywords ──────────────→ IdentifiedConcept (Scleractinia) + β”œβ”€ related_resource ──────→ SampleRelation (Derived DNA extract) + └─ registrant ────────────→ Agent (GEOME Data Manager) + +# DNA extract linked via SampleRelation +MaterialSampleRecord (DNA Extract) + └─ related_resource ──────→ SampleRelation (Derived from tissue) +``` + +#### Full YAML Example + +```yaml +# === PARENT SAMPLE (TISSUE) === +sample_tissue: + otype: MaterialSampleRecord + pid: "igsn:IEGEN0001" + label: "Pocillopora damicornis tissue, Palmyra Atoll" + description: > + Tissue sample from branching coral colony. + Approximately 1cmΒ³ tissue preserved in 95% ethanol. + Colony health: excellent. No visible bleaching. + sample_identifier: "PALM-2024-CORAL-001-T" + +# === CHILD SAMPLE (DNA EXTRACT) === +sample_dna: + otype: MaterialSampleRecord + pid: "igsn:IEGEN0002" + label: "DNA extract from Pocillopora damicornis tissue PALM-2024-CORAL-001-T" + description: > + High molecular weight DNA extracted using Qiagen DNeasy kit. + Concentration: 45 ng/Β΅L. 260/280 ratio: 1.82. + sample_identifier: "PALM-2024-CORAL-001-DNA" + +# === SAMPLING EVENT === +event_collection: + otype: SamplingEvent + pid: "event:palmyra-2024-dive-005" + label: "Palmyra Atoll 2024, Dive 005" + description: > + SCUBA collection at 12m depth. + Reef flat dominated by Pocillopora and Porites. + Minimal impact sampling protocol (1cmΒ² fragments). + result_time: "2024-06-15T11:20:00Z" + has_feature_of_interest: "Coral reef ecosystem" + project: "NSF OCE-2024123: Pacific Coral Genomics" + +# === SAMPLING SITE === +site_palmyra: + otype: SamplingSite + pid: "site:palmyra-atoll-reef" + label: "Palmyra Atoll, Fore Reef Site A" + description: > + Pristine coral reef system. U.S. National Wildlife Refuge. + High coral cover (>50%). Minimal anthropogenic impact. + place_name: + - "Palmyra Atoll" + - "Palmyra Island" + +# === GEOSPATIAL COORDINATES === +coords_site: + otype: GeospatialCoordLocation + pid: "coords:palmyra-site-a" + latitude: 5.87 + longitude: -162.08 + elevation: "-12 m below sea level (dive depth)" + obfuscated: false + +coords_sample: + otype: GeospatialCoordLocation + pid: "coords:palmyra-dive-005-001" + latitude: 5.8715 + longitude: -162.0823 + elevation: "-12 m below sea level" + obfuscated: false + +# === IDENTIFIED CONCEPTS === +concept_tissue: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/material/0.9/organicmaterial" + label: "Organic material > Tissue" + scheme_name: "iSamples Material Type Vocabulary" + +concept_marine_biome: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/sampledfeature/0.9/marinebiome" + label: "Marine biome" + scheme_name: "iSamples Sampled Feature Vocabulary" + +concept_specimen: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/materialsampleobjecttype/0.9/specimen" + label: "Specimen" + scheme_name: "iSamples Material Sample Object Type Vocabulary" + +keyword_species: + otype: IdentifiedConcept + pid: "taxon:pocillopora-damicornis" + label: "Pocillopora damicornis" + description: "Cauliflower coral" + +keyword_coral: + otype: IdentifiedConcept + pid: "keyword:coral" + label: "Coral" + +keyword_scleractinia: + otype: IdentifiedConcept + pid: "taxon:scleractinia" + label: "Scleractinia" + description: "Stony corals" + +# === AGENTS === +agent_collector: + otype: Agent + pid: "https://orcid.org/0000-0001-9876-5432" + name: "Dr. Carlos Alvarez" + affiliation: "University of HawaiΚ»i, HawaiΚ»i Institute of Marine Biology" + role: "Principal Investigator" + +agent_registrant: + otype: Agent + pid: "agent:geome-manager" + name: "GEOME Data Team" + affiliation: "Smithsonian Institution" + role: "Genomic Data Manager" + +# === SAMPLE RELATION (PARENT-CHILD) === +relation_dna_extract: + otype: SampleRelation + pid: "relation:tissue-to-dna-001" + label: "DNA extracted from tissue" + description: "High molecular weight DNA extraction for whole genome sequencing" + relationship: "derivedFrom" + target: "igsn:IEGEN0001" # Points to parent tissue sample + +# === EDGES === +# Tissue sample edges +edge_tissue_event: + s: sample_tissue + p: produced_by + o: [event_collection] + +edge_tissue_material: + s: sample_tissue + p: has_material_category + o: [concept_tissue] + +edge_tissue_context: + s: sample_tissue + p: has_context_category + o: [concept_marine_biome] + +edge_tissue_object: + s: sample_tissue + p: has_sample_object_type + o: [concept_specimen] + +edge_tissue_keyword1: + s: sample_tissue + p: keywords + o: [keyword_species] + +edge_tissue_keyword2: + s: sample_tissue + p: keywords + o: [keyword_coral] + +edge_tissue_keyword3: + s: sample_tissue + p: keywords + o: [keyword_scleractinia] + +edge_tissue_registrant: + s: sample_tissue + p: registrant + o: [agent_registrant] + +# DNA sample β†’ parent tissue relationship +edge_dna_relation: + s: sample_dna + p: related_resource + o: [relation_dna_extract] + +# Event edges +edge_event_site: + s: event_collection + p: sampling_site + o: [site_palmyra] + +edge_event_location: + s: event_collection + p: sample_location + o: [coords_sample] + +edge_event_responsibility: + s: event_collection + p: responsibility + o: [agent_collector] + +# Site edges +edge_site_location: + s: site_palmyra + p: site_location + o: [coords_site] +``` + +### Biology-Specific Patterns + +**What's unique:** +- Heavy use of **related_resource** (tissue β†’ DNA β†’ sequence data) +- **Taxonomic keywords** (species names, higher taxa) +- **Preservation methods** in descriptions ("95% ethanol") +- **Sample chains** (organism β†’ tissue β†’ extract β†’ library) +- **Precise dive/collection coordinates** + +**Edge types used (projected):** 11 of 14 +- βœ… produced_by, has_material_category, has_context_category, has_sample_object_type +- βœ… keywords, registrant, related_resource, sampling_site, sample_location, responsibility (Event), site_location +- ❌ curation, has_context_category (Event), responsibility (Curation) + +--- + +## Cross-Domain Comparison + +### Entity Usage Comparison + +| Entity Type | Archaeology | Geology | Biology | +|-------------|-------------|---------|---------| +| MaterialSampleRecord | Pottery, bone, charcoal | Rocks, cores, minerals | Tissue, DNA, specimens | +| SamplingEvent | Excavation, surface collection | Drilling, dredging | SCUBA, trap, net | +| SamplingSite | Archaeological sites | Formations, localities | Reefs, stations, plots | +| GeospatialCoordLocation | Depth below surface | Depth below sea level | Depth below sea level | +| IdentifiedConcept | Cultural periods, pottery types | Rock types, minerals | Taxa, specimen types | +| Agent | Archaeologists, curators | Geologists, repository staff | Marine biologists, geneticists | +| MaterialSampleCuration | Rarely used | Core repositories | Biobanks, tissue collections | +| SampleRelation | Rare | Rare | Common (parent-child chains) | + +### Predicate Usage Comparison + +| Predicate | Archaeology | Geology | Biology | +|-----------|-------------|---------|---------| +| produced_by | βœ… Every sample | βœ… Every sample | βœ… Every sample | +| has_material_category | Pottery, bone, stone | Basalt, granite, sediment | Tissue, DNA, whole organism | +| has_context_category | Terrestrial/Archaeological | Marine, Terrestrial, Subsurface | Marine biome, Terrestrial | +| has_sample_object_type | Sherd, artifact | Core, hand specimen | Specimen, tissue | +| keywords | Cultural terms, periods | Rock types, formation names | Taxonomic names | +| registrant | βœ… Common | βœ… Common | βœ… Common | +| curation | ❌ Rare | βœ… Very common | βšͺ Sometimes | +| related_resource | ❌ Rare | ❌ Rare | βœ… Very common | +| sampling_site | βœ… Common (site names) | βšͺ Sometimes | βœ… Common (stations, reefs) | +| sample_location | βœ… Very common | βœ… Very common | βœ… Very common | +| responsibility (Event) | βœ… Common | βœ… Common | βœ… Common | +| has_context_category (Event) | ❌ Not used | ❌ Not used | ❌ Not used | +| site_location | βœ… Common | βšͺ Sometimes | βœ… Common | +| responsibility (Curation) | ❌ Not used | βœ… Common | βšͺ Sometimes | + +### Material Type Patterns + +**Archaeology:** +- Anthropogenic material (pottery, glass, metal) +- Organic material (bone, charcoal, wood) +- Rock (stone tools, building materials) +- Soil (sediment samples) + +**Geology:** +- Rock (igneous, sedimentary, metamorphic) +- Mineral (individual mineral specimens) +- Sediment (unconsolidated material) +- Fluid (water, hydrothermal fluids) + +**Biology:** +- Organic material (tissue, DNA, whole organisms) +- Liquid water (seawater, freshwater samples) +- Biogenic non-organic material (shells, coral skeleton) + +--- + +## Domain-Specific Patterns + +### Pattern 1: Archaeological Depth Notation + +**Challenge:** Archaeologists measure depth **below surface**, not elevation above sea level. + +**Solution:** Use elevation field with descriptive text: +```yaml +elevation: "3.2 m below surface" +elevation: "Level 5, 4.8 m below datum" +``` + +### Pattern 2: Marine Sample Depths + +**Challenge:** Marine samples need **negative elevation** (below sea level). + +**Solution:** +```yaml +elevation: "-2500 m below sea level" +elevation: "-12 m (dive depth)" +``` + +### Pattern 3: Sample Chains (Biology) + +**Challenge:** Tissue β†’ DNA β†’ Sequencing Library are all samples. + +**Solution:** Use `related_resource` with `derivedFrom` relationship: +```yaml +# DNA sample points to tissue parent +dna_sample: + related_resource: + - relationship: "derivedFrom" + target: "igsn:TISSUE001" + +# Sequencing library points to DNA parent +library_sample: + related_resource: + - relationship: "derivedFrom" + target: "igsn:DNA001" +``` + +### Pattern 4: Repository Storage (Geology) + +**Challenge:** Core samples stored in repositories with access constraints. + +**Solution:** Use `curation` entity: +```yaml +sample: + curation: + label: "Lamont-Doherty Core Repository" + curation_location: "Palisades, NY" + access_constraints: "Destructive sampling requires approval" + responsibility: + - name: "Core Facility Manager" +``` + +### Pattern 5: Multi-lingual Site Names (Archaeology) + +**Challenge:** Archaeological sites have multiple name spellings. + +**Solution:** Use `place_name` array: +```yaml +site: + place_name: + - "Γ‡atalhΓΆyΓΌk" + - "Γ‡atal HΓΆyΓΌk" + - "Chatal Huyuk" + - "Ψ¬Ψ§ΨͺΨ§Ω„ Ω‡ΩˆΩŠΩˆΩƒ" # Arabic +``` + +--- + +## Summary + +**Key Takeaways:** + +1. **Same schema, different values** - The iSamples model truly works across domains +2. **10-11 of 14 predicates used per domain** - Different domains use different subsets +3. **Context category distinguishes domains** - Terrestrial/Archaeological vs Marine biome vs Subsurface +4. **Material category is domain-specific** - Pottery vs Basalt vs Tissue +5. **Curation patterns differ** - Geology stores cores, archaeology often doesn't track storage +6. **Related_resource is biology-heavy** - Sample chains common in genomics, rare elsewhere + +**Design Wisdom:** + +βœ… **Universal model:** 8 entity types work across all domains +βœ… **Flexible values:** Controlled vocabularies adapt to domain needs +βœ… **Optional predicates:** Each domain uses relevant subset +βœ… **Extensible:** Can add domain-specific keywords without schema changes + +**Next steps:** +- [QUERYING_THE_GRAPH.md](./QUERYING_THE_GRAPH.md) - SQL patterns for cross-domain queries +- [EDGE_TYPES_VISUAL.md](./EDGE_TYPES_VISUAL.md) - Visual diagrams of patterns + +--- + +**Document Version:** 1.0 +**Last Updated:** 2025-11-14 +**Schema Version:** 20250207 (MaterialSampleRecord) +**Author:** Claude Code (Sonnet 4.5) diff --git a/src/docs/PREDICATES_REFERENCE.md b/src/docs/PREDICATES_REFERENCE.md new file mode 100644 index 00000000..638c516e --- /dev/null +++ b/src/docs/PREDICATES_REFERENCE.md @@ -0,0 +1,1009 @@ +# iSamples Predicates Reference + +**Purpose:** Detailed reference for each of the 14 relationship types (predicates) in the iSamples property graph. + +**Audience:** Developers querying iSamples data, data providers creating metadata, tool builders integrating with iSamples. + +--- + +## Quick Reference Table + +| Predicate | Subject β†’ Object | Cardinality | Required | Description | +|-----------|------------------|-------------|----------|-------------| +| [produced_by](#produced_by) | MaterialSampleRecord β†’ SamplingEvent | One | βœ… Yes | Sample creation event | +| [has_material_category](#has_material_category) | MaterialSampleRecord β†’ IdentifiedConcept | Many | βœ… Yes | Material type | +| [has_context_category](#has_context_category) | MaterialSampleRecord β†’ IdentifiedConcept | Many | βœ… Yes | Domain context | +| [has_sample_object_type](#has_sample_object_type) | MaterialSampleRecord β†’ IdentifiedConcept | Many | βœ… Yes | Physical form | +| [keywords](#keywords) | MaterialSampleRecord β†’ IdentifiedConcept | Many | βšͺ No | Discovery keywords | +| [registrant](#registrant) | MaterialSampleRecord β†’ Agent | One | βšͺ No | Registering agent | +| [curation](#curation) | MaterialSampleRecord β†’ MaterialSampleCuration | One | βšͺ No | Storage info | +| [related_resource](#related_resource) | MaterialSampleRecord β†’ SampleRelation | Many | βšͺ No | Sample relationships | +| [sampling_site](#sampling_site) | SamplingEvent β†’ SamplingSite | One | βšͺ No | Named location | +| [sample_location](#sample_location) | SamplingEvent β†’ GeospatialCoordLocation | One | βšͺ No | Precise coords | +| [responsibility](#responsibility-samplingevent) | SamplingEvent β†’ Agent | Many | βšͺ No | Collectors | +| [has_context_category](#has_context_category-samplingevent) | SamplingEvent β†’ IdentifiedConcept | Many | βšͺ No | Event context | +| [site_location](#site_location) | SamplingSite β†’ GeospatialCoordLocation | One | βšͺ No | Site coords | +| [responsibility](#responsibility-materialsamplecuration) | MaterialSampleCuration β†’ Agent | Many | βšͺ No | Curators | + +--- + +## MaterialSampleRecord Predicates + +### produced_by + +**Type:** MaterialSampleRecord β†’ SamplingEvent +**Cardinality:** One (required) +**Required:** βœ… Yes + +#### Purpose +Links a material sample to the event that created/collected it. This is the **most important relationship** in iSamples - every sample must have provenance. + +#### Controlled Vocabulary +Not applicable - targets a SamplingEvent node. + +#### Usage Example + +**YAML:** +```yaml +# Sample node +sample_001: + otype: MaterialSampleRecord + pid: "igsn:SSH000001" + label: "Pottery sherd from Trench 5" + +# Event node +event_001: + otype: SamplingEvent + pid: "event:2023-catal-t5-001" + label: "2023 Excavation, Trench 5, Level 3" + result_time: "2023-07-15" + +# Edge +edge_001: + otype: _edge_ + s: sample_001 # Subject: the sample + p: produced_by # Predicate + o: [event_001] # Object: the event +``` + +#### SQL Query Pattern + +**Find all samples and their collection dates:** +```sql +SELECT + sample.pid AS sample_id, + sample.label AS sample_label, + event.pid AS event_id, + event.label AS event_label, + event.result_time AS collection_date +FROM pqg AS sample +JOIN pqg AS edge + ON edge.s = sample.row_id + AND edge.p = 'produced_by' + AND edge.otype = '_edge_' +JOIN pqg AS event + ON event.row_id = ANY(edge.o) +WHERE sample.otype = 'MaterialSampleRecord'; +``` + +**Find samples collected in a specific time range:** +```sql +SELECT + sample.pid, + sample.label, + event.result_time +FROM pqg AS sample +JOIN pqg AS edge ON edge.s = sample.row_id AND edge.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge.o) +WHERE sample.otype = 'MaterialSampleRecord' + AND event.result_time BETWEEN '2023-01-01' AND '2023-12-31'; +``` + +#### OpenContext Data Stats +- **Frequency:** 1,096,352 relationships (one per sample) +- **Unique subjects:** 1,096,352 samples +- **Unique objects:** 1,096,352 events (1:1 ratio) + +#### Common Issues + +❌ **Missing produced_by:** +``` +Error: MaterialSampleRecord must have produced_by relationship +``` +**Solution:** Every sample requires a SamplingEvent. + +❌ **Multiple produced_by:** +``` +Warning: Sample has multiple produced_by edges (should be one) +``` +**Solution:** Cardinality is ONE - use `related_resource` to link derived samples. + +--- + +### has_material_category + +**Type:** MaterialSampleRecord β†’ IdentifiedConcept +**Cardinality:** Many (required, minimum 1) +**Required:** βœ… Yes + +#### Purpose +Classifies the physical material composition of the sample. Uses controlled vocabulary from iSamples Material Type Vocabulary. + +#### Controlled Vocabulary +[iSamples Material Type Vocabulary](https://w3id.org/isample/vocabulary/material/) + +**Top-level categories:** +- Rock +- Mineral +- Organic material +- Liquid water +- Anthropogenic material (includes pottery, glass, metals) +- Biogenic non-organic material +- Natural solid material +- Soil +- Particulate +- Fluid (non-water) + +**Example subcategories:** +- Rock β†’ Igneous rock β†’ Basalt +- Anthropogenic material β†’ Pottery β†’ Earthenware +- Organic material β†’ Tissue β†’ Bone + +#### Usage Example + +**YAML:** +```yaml +# Sample node +sample_001: + otype: MaterialSampleRecord + pid: "igsn:SSH000001" + label: "Ceramic bowl" + +# Concept nodes (from controlled vocabulary) +concept_earthenware: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/material/0.9/earthenware" + label: "Earthenware" + scheme_name: "iSamples Material Type Vocabulary" + +concept_anthropogenic: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/material/0.9/anthropogenicmaterial" + label: "Anthropogenic material" + +# Edges (multivalued - can have multiple material types) +edge_001: + s: sample_001 + p: has_material_category + o: [concept_earthenware] + +edge_002: + s: sample_001 + p: has_material_category + o: [concept_anthropogenic] +``` + +#### SQL Query Pattern + +**Find all samples of a specific material type:** +```sql +SELECT + sample.pid, + sample.label, + concept.label AS material_type +FROM pqg AS sample +JOIN pqg AS edge + ON edge.s = sample.row_id + AND edge.p = 'has_material_category' +JOIN pqg AS concept + ON concept.row_id = ANY(edge.o) +WHERE sample.otype = 'MaterialSampleRecord' + AND concept.label ILIKE '%earthenware%'; +``` + +**Count samples by material type:** +```sql +SELECT + concept.label AS material_type, + COUNT(DISTINCT sample.pid) AS sample_count +FROM pqg AS sample +JOIN pqg AS edge ON edge.s = sample.row_id AND edge.p = 'has_material_category' +JOIN pqg AS concept ON concept.row_id = ANY(edge.o) +WHERE sample.otype = 'MaterialSampleRecord' +GROUP BY concept.label +ORDER BY sample_count DESC; +``` + +#### OpenContext Data Stats +- **Frequency:** 1,096,352 relationships +- **Unique subjects:** 1,096,352 samples (one per sample) +- **Unique objects:** 10 material type concepts + +**Top material types in OpenContext:** +1. Anthropogenic material (pottery, artifacts) +2. Rock (stone tools, building materials) +3. Organic material (bone, charcoal) +4. Soil (sediment samples) + +#### Common Issues + +❌ **Using free text instead of controlled vocabulary:** +```yaml +# Wrong: +has_material_category: "pottery" + +# Right: +has_material_category: + - pid: "https://w3id.org/isample/vocabulary/material/0.9/earthenware" + label: "Earthenware" +``` + +--- + +### has_context_category + +**Type:** MaterialSampleRecord β†’ IdentifiedConcept +**Cardinality:** Many (required, minimum 1) +**Required:** βœ… Yes + +#### Purpose +Classifies the broad context or sampled feature type. Indicates the domain (archaeology, marine biology, geology, etc.) and environment. + +#### Controlled Vocabulary +[iSamples Sampled Feature Vocabulary](https://w3id.org/isample/vocabulary/sampledfeature/) + +**Top-level categories:** +- Terrestrial environment + - Archaeological + - Subsurface + - Surface +- Marine environment + - Marine biome + - Marine water body + - Submerged terrestrial +- Atmosphere +- Extraterrestrial +- Laboratory or production environment + +#### Usage Example + +**YAML:** +```yaml +sample_001: + otype: MaterialSampleRecord + pid: "igsn:SSH000001" + +concept_archaeological: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/sampledfeature/0.9/terrestrial_archaeological" + label: "Terrestrial environment > Archaeological site" + scheme_name: "iSamples Sampled Feature Vocabulary" + +edge: + s: sample_001 + p: has_context_category + o: [concept_archaeological] +``` + +#### SQL Query Pattern + +**Find all archaeological samples:** +```sql +SELECT + sample.pid, + sample.label, + concept.label AS context +FROM pqg AS sample +JOIN pqg AS edge ON edge.s = sample.row_id AND edge.p = 'has_context_category' +JOIN pqg AS concept ON concept.row_id = ANY(edge.o) +WHERE sample.otype = 'MaterialSampleRecord' + AND concept.label ILIKE '%archaeological%'; +``` + +#### OpenContext Data Stats +- **Frequency:** 1,096,352 relationships +- **Unique subjects:** 1,096,352 samples +- **Unique objects:** 2 context concepts (OpenContext is archaeology-focused) + +--- + +### has_sample_object_type + +**Type:** MaterialSampleRecord β†’ IdentifiedConcept +**Cardinality:** Many (required, minimum 1) +**Required:** βœ… Yes + +#### Purpose +Describes the physical form or object type of the sample. Answers "What kind of object is this?" + +#### Controlled Vocabulary +[iSamples Material Sample Object Type Vocabulary](https://w3id.org/isample/vocabulary/materialsampleobjecttype/) + +**Common types:** +- Core +- Hand specimen +- Thin section +- Powder +- Cube +- Sherd (pottery fragment) +- Specimen (biological) +- Aggregate (multiple pieces) +- Other solid object + +#### Usage Example + +**YAML:** +```yaml +sample_001: + otype: MaterialSampleRecord + pid: "igsn:SSH000001" + label: "Pottery fragment" + +concept_sherd: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/materialsampleobjecttype/0.9/sherd" + label: "Sherd" + scheme_name: "iSamples Material Sample Object Type Vocabulary" + +edge: + s: sample_001 + p: has_sample_object_type + o: [concept_sherd] +``` + +#### SQL Query Pattern + +**Find all core samples:** +```sql +SELECT + sample.pid, + sample.label, + concept.label AS object_type +FROM pqg AS sample +JOIN pqg AS edge ON edge.s = sample.row_id AND edge.p = 'has_sample_object_type' +JOIN pqg AS concept ON concept.row_id = ANY(edge.o) +WHERE concept.label = 'Core'; +``` + +#### OpenContext Data Stats +- **Frequency:** 1,096,352 relationships +- **Unique subjects:** 1,096,352 samples +- **Unique objects:** 5 object type concepts + +--- + +### keywords + +**Type:** MaterialSampleRecord β†’ IdentifiedConcept +**Cardinality:** Many (optional) +**Required:** βšͺ No + +#### Purpose +Free-text keywords for discovery and search. Can include taxonomic names, geographic terms, cultural periods, etc. + +#### Controlled Vocabulary +Not strictly controlled - can use various vocabularies or free text wrapped in IdentifiedConcept. + +#### Usage Example + +**YAML:** +```yaml +sample_001: + otype: MaterialSampleRecord + pid: "igsn:SSH000001" + +keyword_neolithic: + otype: IdentifiedConcept + pid: "keyword:neolithic" + label: "Neolithic" + +keyword_pottery: + otype: IdentifiedConcept + pid: "keyword:pottery" + label: "Pottery" + +keyword_catalhoyuk: + otype: IdentifiedConcept + pid: "keyword:catalhoyuk" + label: "Γ‡atalhΓΆyΓΌk" + +# Multiple edges for multiple keywords +edge_001: + s: sample_001 + p: keywords + o: [keyword_neolithic] + +edge_002: + s: sample_001 + p: keywords + o: [keyword_pottery] + +edge_003: + s: sample_001 + p: keywords + o: [keyword_catalhoyuk] +``` + +#### SQL Query Pattern + +**Find samples with specific keyword:** +```sql +SELECT + sample.pid, + sample.label, + concept.label AS keyword +FROM pqg AS sample +JOIN pqg AS edge ON edge.s = sample.row_id AND edge.p = 'keywords' +JOIN pqg AS concept ON concept.row_id = ANY(edge.o) +WHERE concept.label ILIKE '%neolithic%'; +``` + +**Find samples with multiple keywords (AND logic):** +```sql +WITH sample_keywords AS ( + SELECT + sample.pid, + sample.label, + concept.label AS keyword + FROM pqg AS sample + JOIN pqg AS edge ON edge.s = sample.row_id AND edge.p = 'keywords' + JOIN pqg AS concept ON concept.row_id = ANY(edge.o) +) +SELECT pid, label +FROM sample_keywords +WHERE keyword IN ('Neolithic', 'Pottery') +GROUP BY pid, label +HAVING COUNT(DISTINCT keyword) = 2; +``` + +#### OpenContext Data Stats +- **Frequency:** 1,096,297 relationships (not all samples have keywords) +- **Unique subjects:** 1,096,297 samples +- **Unique objects:** 4,033 unique keyword concepts + +--- + +### registrant + +**Type:** MaterialSampleRecord β†’ Agent +**Cardinality:** One (optional) +**Required:** βšͺ No + +#### Purpose +Identifies the person or organization that registered the sample metadata. + +#### Usage Example + +**YAML:** +```yaml +sample_001: + otype: MaterialSampleRecord + pid: "igsn:SSH000001" + +agent_curator: + otype: Agent + pid: "https://orcid.org/0000-0002-1234-5678" + name: "Jane Smith" + affiliation: "OpenContext" + role: "Data Curator" + +edge: + s: sample_001 + p: registrant + o: [agent_curator] +``` + +#### SQL Query Pattern + +**Find all samples registered by a specific person:** +```sql +SELECT + sample.pid, + sample.label, + agent.name AS registrant_name +FROM pqg AS sample +JOIN pqg AS edge ON edge.s = sample.row_id AND edge.p = 'registrant' +JOIN pqg AS agent ON agent.row_id = ANY(edge.o) +WHERE agent.name ILIKE '%smith%'; +``` + +#### OpenContext Data Stats +- **Frequency:** 413,635 relationships (38% of samples) +- **Unique subjects:** 413,635 samples +- **Unique objects:** 340 agents + +--- + +### curation + +**Type:** MaterialSampleRecord β†’ MaterialSampleCuration +**Cardinality:** One (optional) +**Required:** βšͺ No + +#### Purpose +Links sample to its curation information (storage location, access constraints, curation history). + +#### Usage Example + +**YAML:** +```yaml +sample_001: + otype: MaterialSampleRecord + pid: "igsn:SSH000001" + +curation_001: + otype: MaterialSampleCuration + pid: "curation:smithsonian-nmnh-001" + label: "Smithsonian NMNH Anthropology Collection" + curation_location: "National Museum of Natural History, Washington DC" + access_constraints: "Appointment required" + +edge: + s: sample_001 + p: curation + o: [curation_001] +``` + +#### SQL Query Pattern + +**Find samples stored at specific location:** +```sql +SELECT + sample.pid, + sample.label, + curation.label AS collection_name, + curation.curation_location +FROM pqg AS sample +JOIN pqg AS edge ON edge.s = sample.row_id AND edge.p = 'curation' +JOIN pqg AS curation ON curation.row_id = ANY(edge.o) +WHERE curation.curation_location ILIKE '%smithsonian%'; +``` + +#### OpenContext Data Stats +- **Frequency:** 0 (OpenContext does not track curation information) + +--- + +### related_resource + +**Type:** MaterialSampleRecord β†’ SampleRelation +**Cardinality:** Many (optional) +**Required:** βšͺ No + +#### Purpose +Links sample to other samples via defined relationships (parent-child, sibling, etc.). + +#### Usage Example + +**YAML:** +```yaml +# Parent sample +parent_sample: + otype: MaterialSampleRecord + pid: "igsn:SSH000001" + label: "Whole rock core" + +# Child sample +child_sample: + otype: MaterialSampleRecord + pid: "igsn:SSH000002" + label: "Thin section from core" + +# Relation describing the connection +relation_001: + otype: SampleRelation + pid: "relation:subsample-001" + label: "Thin section derived from core" + relationship: "derivedFrom" + target: "igsn:SSH000001" # Points to parent + +# Edge from child to relation +edge: + s: child_sample + p: related_resource + o: [relation_001] +``` + +#### SQL Query Pattern + +**Find all child samples of a parent:** +```sql +SELECT + child.pid AS child_pid, + child.label AS child_label, + relation.relationship AS relation_type, + parent.pid AS parent_pid, + parent.label AS parent_label +FROM pqg AS child +JOIN pqg AS edge ON edge.s = child.row_id AND edge.p = 'related_resource' +JOIN pqg AS relation ON relation.row_id = ANY(edge.o) +JOIN pqg AS parent ON parent.pid = relation.target +WHERE parent.pid = 'igsn:SSH000001'; +``` + +#### OpenContext Data Stats +- **Frequency:** 0 (OpenContext does not track sample relationships) + +--- + +## SamplingEvent Predicates + +### sampling_site + +**Type:** SamplingEvent β†’ SamplingSite +**Cardinality:** One (optional) +**Required:** βšͺ No + +#### Purpose +Links sampling event to a named sampling site. + +#### Usage Example + +**YAML:** +```yaml +event_001: + otype: SamplingEvent + pid: "event:2023-catal-001" + +site_001: + otype: SamplingSite + pid: "site:catalhoyuk-south" + label: "Γ‡atalhΓΆyΓΌk South Area" + place_name: ["Γ‡atalhΓΆyΓΌk", "Γ‡atal HΓΆyΓΌk"] + +edge: + s: event_001 + p: sampling_site + o: [site_001] +``` + +#### SQL Query Pattern + +**Find all samples from a specific site:** +```sql +SELECT + sample.pid, + sample.label, + site.label AS site_name +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'sampling_site' +JOIN pqg AS site ON site.row_id = ANY(edge2.o) +WHERE site.label ILIKE '%Γ§atalhΓΆyΓΌk%'; +``` + +#### OpenContext Data Stats +- **Frequency:** 1,096,352 relationships +- **Unique subjects:** 1,096,352 events +- **Unique objects:** 18,213 sites + +--- + +### sample_location + +**Type:** SamplingEvent β†’ GeospatialCoordLocation +**Cardinality:** One (optional) +**Required:** βšͺ No + +#### Purpose +Precise geographic coordinates where sample was collected. + +#### Usage Example + +**YAML:** +```yaml +event_001: + otype: SamplingEvent + pid: "event:2023-catal-001" + +coords_001: + otype: GeospatialCoordLocation + pid: "coords:37.6665-32.8274" + latitude: 37.6665 + longitude: 32.8274 + elevation: "1015 m above mean sea level" + +edge: + s: event_001 + p: sample_location + o: [coords_001] +``` + +#### SQL Query Pattern + +**Find all samples with coordinates (most important query!):** +```sql +SELECT + sample.pid, + sample.label, + coords.latitude, + coords.longitude, + coords.elevation +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'sample_location' +JOIN pqg AS coords ON coords.row_id = ANY(edge2.o) +WHERE sample.otype = 'MaterialSampleRecord' + AND coords.latitude IS NOT NULL; +``` + +**Find samples within bounding box:** +```sql +SELECT + sample.pid, + coords.latitude, + coords.longitude +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'sample_location' +JOIN pqg AS coords ON coords.row_id = ANY(edge2.o) +WHERE coords.latitude BETWEEN 37.0 AND 38.0 + AND coords.longitude BETWEEN 32.0 AND 33.0; +``` + +#### OpenContext Data Stats +- **Frequency:** 1,096,274 relationships (99.99% of events) +- **Unique subjects:** 1,096,274 events +- **Unique objects:** 190,566 coordinate pairs + +--- + +### responsibility (SamplingEvent) + +**Type:** SamplingEvent β†’ Agent +**Cardinality:** Many (optional) +**Required:** βšͺ No + +#### Purpose +Identifies person(s) responsible for sample collection at the event. + +#### Usage Example + +**YAML:** +```yaml +event_001: + otype: SamplingEvent + pid: "event:2023-catal-001" + +agent_001: + otype: Agent + pid: "https://orcid.org/0000-0002-1234-5678" + name: "Dr. Jane Smith" + role: "Principal Investigator" + +agent_002: + otype: Agent + pid: "https://orcid.org/0000-0002-5678-1234" + name: "John Doe" + role: "Field Technician" + +edge_001: + s: event_001 + p: responsibility + o: [agent_001] + +edge_002: + s: event_001 + p: responsibility + o: [agent_002] +``` + +#### SQL Query Pattern + +**Find all samples collected by specific person:** +```sql +SELECT + sample.pid, + sample.label, + agent.name AS collector +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'responsibility' +JOIN pqg AS agent ON agent.row_id = ANY(edge2.o) +WHERE agent.name ILIKE '%smith%'; +``` + +#### OpenContext Data Stats +- **Frequency:** 1,095,272 relationships +- **Unique subjects:** 1,095,272 events +- **Unique objects:** 197 agents + +--- + +### has_context_category (SamplingEvent) + +**Type:** SamplingEvent β†’ IdentifiedConcept +**Cardinality:** Many (optional) +**Required:** βšͺ No + +#### Purpose +Context classification at the event level (separate from sample-level context). + +#### Usage Example + +**YAML:** +```yaml +event_001: + otype: SamplingEvent + pid: "event:marine-expedition-001" + +concept_marine: + otype: IdentifiedConcept + pid: "https://w3id.org/isample/vocabulary/sampledfeature/0.9/marinebiome" + label: "Marine biome" + +edge: + s: event_001 + p: has_context_category + o: [concept_marine] +``` + +#### OpenContext Data Stats +- **Frequency:** 0 (OpenContext does not use event-level context) + +--- + +## SamplingSite Predicates + +### site_location + +**Type:** SamplingSite β†’ GeospatialCoordLocation +**Cardinality:** One (optional) +**Required:** βšͺ No + +#### Purpose +Geographic coordinates for the sampling site (typically less precise than sample_location). + +#### Usage Example + +**YAML:** +```yaml +site_001: + otype: SamplingSite + pid: "site:catalhoyuk" + label: "Γ‡atalhΓΆyΓΌk" + +coords_site: + otype: GeospatialCoordLocation + pid: "coords:site-catalhoyuk" + latitude: 37.666 + longitude: 32.827 + elevation: "1000 m above mean sea level" + +edge: + s: site_001 + p: site_location + o: [coords_site] +``` + +#### SQL Query Pattern + +**Find all sites with coordinates:** +```sql +SELECT + site.pid, + site.label AS site_name, + coords.latitude, + coords.longitude +FROM pqg AS site +JOIN pqg AS edge ON edge.s = site.row_id AND edge.p = 'site_location' +JOIN pqg AS coords ON coords.row_id = ANY(edge.o) +WHERE site.otype = 'SamplingSite'; +``` + +#### OpenContext Data Stats +- **Frequency:** 18,213 relationships +- **Unique subjects:** 18,213 sites +- **Unique objects:** 18,213 coordinate pairs (1:1) + +--- + +## MaterialSampleCuration Predicates + +### responsibility (MaterialSampleCuration) + +**Type:** MaterialSampleCuration β†’ Agent +**Cardinality:** Many (optional) +**Required:** βšͺ No + +#### Purpose +Identifies person(s) responsible for sample curation. + +#### Usage Example + +**YAML:** +```yaml +curation_001: + otype: MaterialSampleCuration + pid: "curation:smithsonian-001" + +agent_curator: + otype: Agent + pid: "curator:jsmith" + name: "Jane Smith" + role: "Collection Manager" + +edge: + s: curation_001 + p: responsibility + o: [agent_curator] +``` + +#### OpenContext Data Stats +- **Frequency:** 0 (OpenContext does not track curation) + +--- + +## Cross-Reference: Predicate Usage by Domain + +### OpenContext (Archaeology) - Uses 10 of 14 + +βœ… **Used:** +1. produced_by +2. has_material_category +3. has_context_category +4. has_sample_object_type +5. keywords +6. registrant +7. sampling_site +8. sample_location +9. responsibility (SamplingEvent) +10. site_location + +❌ **Not used:** +- curation +- related_resource +- has_context_category (SamplingEvent) +- responsibility (MaterialSampleCuration) + +### Expected SESAR (Geology) - Projected 8-10 of 14 + +βœ… **Likely used:** +1. produced_by +2. has_material_category +3. has_context_category +4. has_sample_object_type +5. sample_location +6. curation +7. responsibility (SamplingEvent) +8. responsibility (MaterialSampleCuration) + +### Expected GEOME (Biology) - Projected 9-11 of 14 + +βœ… **Likely used:** +1. produced_by +2. has_material_category +3. has_context_category +4. has_sample_object_type +5. keywords +6. related_resource (parent-child samples) +7. sampling_site +8. sample_location +9. responsibility (SamplingEvent) + +--- + +## Summary + +**Key Takeaways:** + +1. **4 predicates are required** - produced_by, has_material_category, has_context_category, has_sample_object_type +2. **3 predicates involve coordinates** - sample_location, site_location (plus is_part_of for nested sites) +3. **3 predicates involve agents** - registrant, responsibility (SamplingEvent), responsibility (MaterialSampleCuration) +4. **2 predicates share names** - responsibility, has_context_category (different subjects) +5. **Different domains use different subsets** - Same schema, different instantiations + +**Next steps:** +- [EXAMPLES_BY_DOMAIN.md](./EXAMPLES_BY_DOMAIN.md) - See these predicates in real-world examples +- [QUERYING_THE_GRAPH.md](./QUERYING_THE_GRAPH.md) - More complex query patterns + +--- + +**Document Version:** 1.0 +**Last Updated:** 2025-11-14 +**Schema Version:** 20250207 (MaterialSampleRecord) +**Author:** Claude Code (Sonnet 4.5) diff --git a/src/docs/QUERYING_THE_GRAPH.md b/src/docs/QUERYING_THE_GRAPH.md new file mode 100644 index 00000000..ae3a2273 --- /dev/null +++ b/src/docs/QUERYING_THE_GRAPH.md @@ -0,0 +1,774 @@ +# Querying the Property Graph: Practical SQL Patterns + +This guide provides practical SQL query patterns for working with iSamples property graph data. All queries are designed for DuckDB, but the patterns work with other SQL databases. + +## Table of Contents + +1. [Understanding the Storage Model](#understanding-the-storage-model) +2. [Basic Entity Queries](#basic-entity-queries) +3. [Single-Hop Traversals](#single-hop-traversals) +4. [Multi-Hop Traversals](#multi-hop-traversals) +5. [Aggregation and Statistics](#aggregation-and-statistics) +6. [Filtering and Search](#filtering-and-search) +7. [Complex Query Patterns](#complex-query-patterns) +8. [Performance Optimization](#performance-optimization) +9. [Common Query Recipes](#common-query-recipes) + +--- + +## Understanding the Storage Model + +### The Unified Table Structure + +All nodes and edges are stored in a single table with these key columns: + +```sql +CREATE TABLE pqg ( + row_id INTEGER PRIMARY KEY, -- Internal identifier + pid TEXT, -- Persistent identifier (for entities) + otype TEXT, -- Object type: entity type OR '_edge_' + s INTEGER, -- Subject row_id (for edges) + p TEXT, -- Predicate (for edges) + o INTEGER[], -- Array of object row_ids (for edges) + n TEXT, -- Node value (for simple nodes) + -- Plus entity-specific columns (label, description, etc.) +); +``` + +### Key Concepts + +1. **Entities** have `otype` ∈ {MaterialSampleRecord, SamplingEvent, ...} +2. **Edges** have `otype = '_edge_'` +3. **Relationships** require joining through edge rows +4. **Multi-valued predicates** store multiple objects in `o` array + +--- + +## Basic Entity Queries + +### Count Entities by Type + +```sql +-- Get counts of each entity type +SELECT + otype, + COUNT(*) as count +FROM pqg +WHERE otype != '_edge_' +GROUP BY otype +ORDER BY count DESC; +``` + +**Example output (OpenContext data):** +``` +otype count +MaterialSampleRecord 1,096,352 +IdentifiedConcept 8,270,644 +GeospatialCoordLocation 1,095,912 +SamplingSite 383,912 +SamplingEvent 1,095,912 +Agent 72,520 +``` + +### Find Entity by PID + +```sql +-- Look up a specific sample +SELECT * +FROM pqg +WHERE pid = 'iSamples:OPENCONTEXT:1b22b93b-...' + AND otype = 'MaterialSampleRecord'; +``` + +### List All Samples with Basic Info + +```sql +-- Get first 1000 samples with labels +SELECT + pid, + label, + description +FROM pqg +WHERE otype = 'MaterialSampleRecord' +LIMIT 1000; +``` + +--- + +## Single-Hop Traversals + +### Pattern: Entity β†’ Edge β†’ Entity + +For a single relationship, you need: +1. Start entity (subject) +2. Edge row connecting them +3. Target entity (object) + +### Example: Find Sampling Events for Samples + +```sql +-- Which sampling event produced this sample? +SELECT + sample.pid AS sample_id, + sample.label AS sample_label, + event.pid AS event_id, + event.label AS event_label +FROM pqg AS sample +-- Join to edge +JOIN pqg AS edge + ON edge.s = sample.row_id + AND edge.p = 'produced_by' + AND edge.otype = '_edge_' +-- Join to target entity +JOIN pqg AS event + ON event.row_id = ANY(edge.o) + AND event.otype = 'SamplingEvent' +WHERE sample.otype = 'MaterialSampleRecord' +LIMIT 100; +``` + +**Key pattern:** +- `edge.s = sample.row_id` - Edge starts at sample +- `edge.p = 'produced_by'` - Predicate identifies relationship type +- `event.row_id = ANY(edge.o)` - Handle multi-valued predicates +- Always filter by `otype` for performance + +### Example: Find Material Categories for Sample + +```sql +-- What material types does this sample have? +SELECT + sample.pid AS sample_id, + sample.label AS sample_label, + material.pid AS material_category_id, + material.label AS material_category +FROM pqg AS sample +JOIN pqg AS edge + ON edge.s = sample.row_id + AND edge.p = 'has_material_category' +JOIN pqg AS material + ON material.row_id = ANY(edge.o) + AND material.otype = 'IdentifiedConcept' +WHERE sample.pid = 'iSamples:OPENCONTEXT:...' + AND sample.otype = 'MaterialSampleRecord'; +``` + +**Note:** `has_material_category` is **multivalued**, so a sample may have multiple material types. The `ANY(edge.o)` handles this array. + +--- + +## Multi-Hop Traversals + +### Pattern: Chaining Relationships + +Many useful queries require following multiple edges: + +``` +MaterialSampleRecord + --produced_by--> SamplingEvent + --sample_location--> GeospatialCoordLocation +``` + +### Example: Samples with Coordinates (2-hop) + +```sql +-- Find all samples with geographic coordinates +SELECT + sample.pid AS sample_id, + sample.label AS sample_label, + coords.latitude, + coords.longitude, + coords.elevation +FROM pqg AS sample +-- First hop: sample β†’ event +JOIN pqg AS edge1 + ON edge1.s = sample.row_id + AND edge1.p = 'produced_by' +JOIN pqg AS event + ON event.row_id = ANY(edge1.o) + AND event.otype = 'SamplingEvent' +-- Second hop: event β†’ coordinates +JOIN pqg AS edge2 + ON edge2.s = event.row_id + AND edge2.p = 'sample_location' +JOIN pqg AS coords + ON coords.row_id = ANY(edge2.o) + AND coords.otype = 'GeospatialCoordLocation' +WHERE sample.otype = 'MaterialSampleRecord' + AND coords.latitude IS NOT NULL + AND coords.longitude IS NOT NULL +LIMIT 1000; +``` + +**Performance note:** This is the most common query pattern in iSamples - optimize it with indexes on `row_id`, `s`, `p`, `otype`. + +### Example: Samples β†’ Site Name (3-hop) + +```sql +-- Get sampling site names for samples +SELECT + sample.pid AS sample_id, + sample.label AS sample_label, + site.pid AS site_id, + site.label AS site_name, + site_coords.latitude AS site_lat, + site_coords.longitude AS site_lon +FROM pqg AS sample +-- Hop 1: sample β†’ event +JOIN pqg AS edge1 + ON edge1.s = sample.row_id + AND edge1.p = 'produced_by' +JOIN pqg AS event + ON event.row_id = ANY(edge1.o) + AND event.otype = 'SamplingEvent' +-- Hop 2: event β†’ site +JOIN pqg AS edge2 + ON edge2.s = event.row_id + AND edge2.p = 'sampling_site' +JOIN pqg AS site + ON site.row_id = ANY(edge2.o) + AND site.otype = 'SamplingSite' +-- Hop 3: site β†’ coordinates +JOIN pqg AS edge3 + ON edge3.s = site.row_id + AND edge3.p = 'site_location' +JOIN pqg AS site_coords + ON site_coords.row_id = ANY(edge3.o) + AND site_coords.otype = 'GeospatialCoordLocation' +WHERE sample.otype = 'MaterialSampleRecord' +LIMIT 1000; +``` + +**Design note:** `sampling_site` is optional in iSamples schema, so use `LEFT JOIN` if you want samples without sites. + +--- + +## Aggregation and Statistics + +### Count Edge Types in Dataset + +```sql +-- Which relationship types are actually used? +SELECT + otype AS subject_type, + p AS predicate, + COUNT(*) AS edge_count +FROM pqg +WHERE otype = '_edge_' +GROUP BY otype, p +ORDER BY edge_count DESC; +``` + +**Example output (OpenContext):** +``` +subject_type predicate edge_count +_edge_ has_sample_object_type 1,124,480 +_edge_ produced_by 1,096,352 +_edge_ has_material_category 1,095,920 +_edge_ has_context_category 1,095,912 +_edge_ keywords 1,070,912 +``` + +### Samples per Material Category + +```sql +-- How many samples for each material type? +SELECT + material.label AS material_category, + COUNT(DISTINCT sample.pid) AS sample_count +FROM pqg AS sample +JOIN pqg AS edge + ON edge.s = sample.row_id + AND edge.p = 'has_material_category' +JOIN pqg AS material + ON material.row_id = ANY(edge.o) + AND material.otype = 'IdentifiedConcept' +WHERE sample.otype = 'MaterialSampleRecord' +GROUP BY material.label +ORDER BY sample_count DESC +LIMIT 20; +``` + +### Geographic Bounding Box + +```sql +-- Find extent of all sample locations +SELECT + MIN(coords.latitude) AS min_lat, + MAX(coords.latitude) AS max_lat, + MIN(coords.longitude) AS min_lon, + MAX(coords.longitude) AS max_lon, + COUNT(DISTINCT sample.pid) AS sample_count +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'sample_location' +JOIN pqg AS coords ON coords.row_id = ANY(edge2.o) +WHERE sample.otype = 'MaterialSampleRecord' + AND coords.latitude IS NOT NULL + AND coords.longitude IS NOT NULL; +``` + +--- + +## Filtering and Search + +### Filter by Material Category + +```sql +-- Find all pottery samples +SELECT + sample.pid, + sample.label, + material.label AS material_type +FROM pqg AS sample +JOIN pqg AS edge + ON edge.s = sample.row_id + AND edge.p = 'has_material_category' +JOIN pqg AS material + ON material.row_id = ANY(edge.o) + AND material.otype = 'IdentifiedConcept' +WHERE sample.otype = 'MaterialSampleRecord' + AND material.label ILIKE '%pottery%' +LIMIT 1000; +``` + +**Note:** Use `ILIKE` for case-insensitive matching, `LIKE` for case-sensitive. + +### Filter by Geographic Region + +```sql +-- Find samples in Turkey (approximate bounding box) +SELECT + sample.pid, + sample.label, + coords.latitude, + coords.longitude +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'sample_location' +JOIN pqg AS coords ON coords.row_id = ANY(edge2.o) +WHERE sample.otype = 'MaterialSampleRecord' + AND coords.latitude BETWEEN 36.0 AND 42.0 + AND coords.longitude BETWEEN 26.0 AND 45.0 +LIMIT 1000; +``` + +### Filter by Keyword + +```sql +-- Find samples with specific keyword +SELECT + sample.pid, + sample.label, + keyword.label AS keyword +FROM pqg AS sample +JOIN pqg AS edge + ON edge.s = sample.row_id + AND edge.p = 'keywords' +JOIN pqg AS keyword + ON keyword.row_id = ANY(edge.o) + AND keyword.otype = 'IdentifiedConcept' +WHERE sample.otype = 'MaterialSampleRecord' + AND keyword.label ILIKE '%neolithic%' +LIMIT 1000; +``` + +### Combine Multiple Filters + +```sql +-- Find pottery samples from Turkey with coordinates +SELECT + sample.pid, + sample.label, + material.label AS material, + coords.latitude, + coords.longitude +FROM pqg AS sample +-- Material category +JOIN pqg AS mat_edge + ON mat_edge.s = sample.row_id + AND mat_edge.p = 'has_material_category' +JOIN pqg AS material + ON material.row_id = ANY(mat_edge.o) + AND material.otype = 'IdentifiedConcept' +-- Coordinates +JOIN pqg AS event_edge + ON event_edge.s = sample.row_id + AND event_edge.p = 'produced_by' +JOIN pqg AS event + ON event.row_id = ANY(event_edge.o) +JOIN pqg AS coord_edge + ON coord_edge.s = event.row_id + AND coord_edge.p = 'sample_location' +JOIN pqg AS coords + ON coords.row_id = ANY(coord_edge.o) + AND coords.otype = 'GeospatialCoordLocation' +WHERE sample.otype = 'MaterialSampleRecord' + AND material.label ILIKE '%pottery%' + AND coords.latitude BETWEEN 36.0 AND 42.0 + AND coords.longitude BETWEEN 26.0 AND 45.0 +LIMIT 1000; +``` + +--- + +## Complex Query Patterns + +### Find Samples Missing Specific Relationships + +```sql +-- Samples without material category (quality check) +SELECT + sample.pid, + sample.label +FROM pqg AS sample +WHERE sample.otype = 'MaterialSampleRecord' + AND NOT EXISTS ( + SELECT 1 + FROM pqg AS edge + WHERE edge.s = sample.row_id + AND edge.p = 'has_material_category' + AND edge.otype = '_edge_' + ) +LIMIT 1000; +``` + +### Samples with Multiple Material Categories + +```sql +-- Samples categorized as multiple material types +SELECT + sample.pid, + sample.label, + ARRAY_AGG(material.label) AS material_categories, + COUNT(*) AS category_count +FROM pqg AS sample +JOIN pqg AS edge + ON edge.s = sample.row_id + AND edge.p = 'has_material_category' +JOIN pqg AS material + ON material.row_id = ANY(edge.o) + AND material.otype = 'IdentifiedConcept' +WHERE sample.otype = 'MaterialSampleRecord' +GROUP BY sample.pid, sample.label +HAVING COUNT(*) > 1 +ORDER BY category_count DESC +LIMIT 100; +``` + +### Hierarchical Queries (Parent-Child Samples) + +```sql +-- Find child samples and their parents +SELECT + child.pid AS child_id, + child.label AS child_label, + relation.relationship_type, + parent.pid AS parent_id, + parent.label AS parent_label +FROM pqg AS child +-- Child β†’ SampleRelation edge +JOIN pqg AS edge1 + ON edge1.s = child.row_id + AND edge1.p = 'related_resource' +JOIN pqg AS relation + ON relation.row_id = ANY(edge1.o) + AND relation.otype = 'SampleRelation' +-- SampleRelation β†’ Parent edge +JOIN pqg AS edge2 + ON edge2.s = relation.row_id + AND edge2.p = 'related_sample' -- Assuming this predicate exists +JOIN pqg AS parent + ON parent.row_id = ANY(edge2.o) + AND parent.otype = 'MaterialSampleRecord' +WHERE child.otype = 'MaterialSampleRecord' + AND relation.relationship_type = 'isPartOf' +LIMIT 1000; +``` + +### Spatial Proximity Search + +```sql +-- Find samples within ~10km of a point (approximate) +-- 1 degree latitude β‰ˆ 111km, 1 degree longitude β‰ˆ 111km * cos(latitude) +WITH target AS ( + SELECT 37.5 AS target_lat, 32.8 AS target_lon -- Γ‡atalhΓΆyΓΌk +) +SELECT + sample.pid, + sample.label, + coords.latitude, + coords.longitude, + -- Approximate distance in km + 111.0 * SQRT( + POWER(coords.latitude - target.target_lat, 2) + + POWER((coords.longitude - target.target_lon) * COS(RADIANS(target.target_lat)), 2) + ) AS distance_km +FROM pqg AS sample +CROSS JOIN target +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'sample_location' +JOIN pqg AS coords ON coords.row_id = ANY(edge2.o) +WHERE sample.otype = 'MaterialSampleRecord' + AND coords.latitude IS NOT NULL + AND coords.longitude IS NOT NULL + AND ABS(coords.latitude - target.target_lat) < 0.1 -- Pre-filter + AND ABS(coords.longitude - target.target_lon) < 0.1 +ORDER BY distance_km +LIMIT 100; +``` + +**Note:** For precise geospatial calculations, use PostGIS or DuckDB spatial extension. + +--- + +## Performance Optimization + +### Use Indexes + +```sql +-- Create indexes for common join patterns +CREATE INDEX idx_row_id ON pqg(row_id); +CREATE INDEX idx_edge_s ON pqg(s) WHERE otype = '_edge_'; +CREATE INDEX idx_edge_p ON pqg(p) WHERE otype = '_edge_'; +CREATE INDEX idx_otype ON pqg(otype); +CREATE INDEX idx_pid ON pqg(pid) WHERE otype != '_edge_'; +``` + +### Filter Early + +```sql +-- ❌ BAD: Filter after all joins +SELECT sample.pid, coords.latitude +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id +JOIN pqg AS coords ON coords.row_id = ANY(edge2.o) +WHERE sample.otype = 'MaterialSampleRecord' -- Too late! + AND coords.latitude > 40.0; + +-- βœ… GOOD: Filter in JOIN conditions +SELECT sample.pid, coords.latitude +FROM pqg AS sample +JOIN pqg AS edge1 + ON edge1.s = sample.row_id + AND edge1.p = 'produced_by' + AND edge1.otype = '_edge_' +JOIN pqg AS event + ON event.row_id = ANY(edge1.o) + AND event.otype = 'SamplingEvent' +JOIN pqg AS edge2 + ON edge2.s = event.row_id + AND edge2.p = 'sample_location' + AND edge2.otype = '_edge_' +JOIN pqg AS coords + ON coords.row_id = ANY(edge2.o) + AND coords.otype = 'GeospatialCoordLocation' + AND coords.latitude > 40.0 -- Filter here! +WHERE sample.otype = 'MaterialSampleRecord'; +``` + +### Use CTEs for Readability + +```sql +-- Break complex queries into steps +WITH samples_with_events AS ( + SELECT + sample.row_id AS sample_row_id, + sample.pid AS sample_pid, + sample.label AS sample_label, + event.row_id AS event_row_id + FROM pqg AS sample + JOIN pqg AS edge ON edge.s = sample.row_id AND edge.p = 'produced_by' + JOIN pqg AS event ON event.row_id = ANY(edge.o) + WHERE sample.otype = 'MaterialSampleRecord' + AND event.otype = 'SamplingEvent' +), +events_with_coords AS ( + SELECT + swe.sample_row_id, + swe.sample_pid, + swe.sample_label, + coords.latitude, + coords.longitude + FROM samples_with_events swe + JOIN pqg AS edge ON edge.s = swe.event_row_id AND edge.p = 'sample_location' + JOIN pqg AS coords ON coords.row_id = ANY(edge.o) + WHERE coords.otype = 'GeospatialCoordLocation' + AND coords.latitude IS NOT NULL +) +SELECT * FROM events_with_coords +LIMIT 1000; +``` + +### Limit Result Sets + +```sql +-- Always use LIMIT for exploratory queries +SELECT * FROM pqg LIMIT 1000; -- βœ… + +-- Dangerous without LIMIT on large datasets +SELECT * FROM pqg; -- ❌ Could return millions of rows +``` + +--- + +## Common Query Recipes + +### Recipe 1: Export Samples with Full Metadata + +```sql +-- Complete sample export with all key attributes +SELECT + sample.pid AS sample_id, + sample.label AS sample_name, + sample.description, + material.label AS material_type, + context.label AS context_category, + coords.latitude, + coords.longitude, + coords.elevation, + site.label AS site_name, + agent.label AS collector +FROM pqg AS sample +-- Material +LEFT JOIN pqg AS mat_edge ON mat_edge.s = sample.row_id AND mat_edge.p = 'has_material_category' +LEFT JOIN pqg AS material ON material.row_id = ANY(mat_edge.o) AND material.otype = 'IdentifiedConcept' +-- Context +LEFT JOIN pqg AS ctx_edge ON ctx_edge.s = sample.row_id AND ctx_edge.p = 'has_context_category' +LEFT JOIN pqg AS context ON context.row_id = ANY(ctx_edge.o) AND context.otype = 'IdentifiedConcept' +-- Event and location +LEFT JOIN pqg AS event_edge ON event_edge.s = sample.row_id AND event_edge.p = 'produced_by' +LEFT JOIN pqg AS event ON event.row_id = ANY(event_edge.o) AND event.otype = 'SamplingEvent' +LEFT JOIN pqg AS coord_edge ON coord_edge.s = event.row_id AND coord_edge.p = 'sample_location' +LEFT JOIN pqg AS coords ON coords.row_id = ANY(coord_edge.o) AND coords.otype = 'GeospatialCoordLocation' +-- Site +LEFT JOIN pqg AS site_edge ON site_edge.s = event.row_id AND site_edge.p = 'sampling_site' +LEFT JOIN pqg AS site ON site.row_id = ANY(site_edge.o) AND site.otype = 'SamplingSite' +-- Collector +LEFT JOIN pqg AS agent_edge ON agent_edge.s = event.row_id AND agent_edge.p = 'responsibility' +LEFT JOIN pqg AS agent ON agent.row_id = ANY(agent_edge.o) AND agent.otype = 'Agent' +WHERE sample.otype = 'MaterialSampleRecord' +LIMIT 10000; +``` + +### Recipe 2: Validate Data Quality + +```sql +-- Find samples with potential data issues +SELECT + 'No material category' AS issue, + COUNT(*) AS count +FROM pqg AS sample +WHERE sample.otype = 'MaterialSampleRecord' + AND NOT EXISTS ( + SELECT 1 FROM pqg AS edge + WHERE edge.s = sample.row_id AND edge.p = 'has_material_category' + ) +UNION ALL +SELECT + 'No sampling event', + COUNT(*) +FROM pqg AS sample +WHERE sample.otype = 'MaterialSampleRecord' + AND NOT EXISTS ( + SELECT 1 FROM pqg AS edge + WHERE edge.s = sample.row_id AND edge.p = 'produced_by' + ) +UNION ALL +SELECT + 'No coordinates', + COUNT(*) +FROM pqg AS sample +WHERE sample.otype = 'MaterialSampleRecord' + AND NOT EXISTS ( + SELECT 1 FROM pqg AS e1 + JOIN pqg AS event ON event.row_id = ANY(e1.o) + JOIN pqg AS e2 ON e2.s = event.row_id AND e2.p = 'sample_location' + WHERE e1.s = sample.row_id AND e1.p = 'produced_by' + ); +``` + +### Recipe 3: Generate GeoJSON + +```sql +-- Create GeoJSON for web mapping +SELECT json_object( + 'type', 'FeatureCollection', + 'features', json_group_array( + json_object( + 'type', 'Feature', + 'geometry', json_object( + 'type', 'Point', + 'coordinates', json_array(coords.longitude, coords.latitude) + ), + 'properties', json_object( + 'id', sample.pid, + 'label', sample.label, + 'material', material.label + ) + ) + ) +) AS geojson +FROM pqg AS sample +JOIN pqg AS mat_edge ON mat_edge.s = sample.row_id AND mat_edge.p = 'has_material_category' +JOIN pqg AS material ON material.row_id = ANY(mat_edge.o) +JOIN pqg AS event_edge ON event_edge.s = sample.row_id AND event_edge.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(event_edge.o) +JOIN pqg AS coord_edge ON coord_edge.s = event.row_id AND coord_edge.p = 'sample_location' +JOIN pqg AS coords ON coords.row_id = ANY(coord_edge.o) +WHERE sample.otype = 'MaterialSampleRecord' + AND coords.latitude IS NOT NULL + AND coords.longitude IS NOT NULL +LIMIT 1000; +``` + +### Recipe 4: Time Series Analysis + +```sql +-- Samples by collection year (if date data available) +SELECT + EXTRACT(YEAR FROM event.event_date) AS collection_year, + COUNT(DISTINCT sample.pid) AS sample_count +FROM pqg AS sample +JOIN pqg AS edge ON edge.s = sample.row_id AND edge.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge.o) +WHERE sample.otype = 'MaterialSampleRecord' + AND event.otype = 'SamplingEvent' + AND event.event_date IS NOT NULL +GROUP BY collection_year +ORDER BY collection_year; +``` + +--- + +## Next Steps + +- **Visual diagrams**: See [EDGE_TYPES_VISUAL.md](EDGE_TYPES_VISUAL.md) for entity relationship diagrams +- **Predicate reference**: See [PREDICATES_REFERENCE.md](PREDICATES_REFERENCE.md) for detailed predicate documentation +- **Graph structure**: See [UNDERSTANDING_THE_GRAPH.md](UNDERSTANDING_THE_GRAPH.md) for conceptual overview +- **Real examples**: See [EXAMPLES_BY_DOMAIN.md](EXAMPLES_BY_DOMAIN.md) for complete domain-specific examples + +--- + +## Tips and Best Practices + +1. **Always filter by `otype`** in JOIN conditions for performance +2. **Use `ANY(edge.o)`** to handle multi-valued predicates correctly +3. **Start simple** - test single-hop queries before chaining multiple relationships +4. **Use CTEs** to break down complex queries into understandable steps +5. **Add LIMIT** to all exploratory queries +6. **Check for NULL values** in coordinate and date fields +7. **Use LEFT JOIN** when relationships are optional +8. **Explain your queries** with `EXPLAIN QUERY PLAN` to optimize performance +9. **Batch queries** when possible instead of running thousands individually +10. **Document your patterns** - complex graph traversals are hard to remember! + +--- + +**Last updated:** 2025-11-14 +**Part of:** iSamples Property Graph Documentation Suite diff --git a/src/docs/UNDERSTANDING_THE_GRAPH.md b/src/docs/UNDERSTANDING_THE_GRAPH.md new file mode 100644 index 00000000..075f5ca9 --- /dev/null +++ b/src/docs/UNDERSTANDING_THE_GRAPH.md @@ -0,0 +1,664 @@ +# Understanding the iSamples Property Graph + +**Purpose:** This guide helps you understand how iSamples metadata is structured as a property graph, making it easier to query, validate, and work with material sample data. + +**Key Insight:** The iSamples property graph has a well-defined **grammar** consisting of 8 entity types and 14 relationship types. Understanding this grammar is essential for working with PQG-formatted data. + +--- + +## Table of Contents + +1. [What is a Property Graph?](#what-is-a-property-graph) +2. [The 8 Entity Types (oTypes)](#the-8-entity-types-otypes) +3. [The 14 Relationship Types (Predicates)](#the-14-relationship-types-predicates) +4. [The 14 Sentence Types](#the-14-sentence-types) +5. [Why This Structure?](#why-this-structure) +6. [Graph Traversal Patterns](#graph-traversal-patterns) +7. [Storage Format](#storage-format) + +--- + +## What is a Property Graph? + +A **property graph** represents data as: +- **Nodes** (entities with properties) +- **Edges** (relationships between nodes) + +**Why use a graph?** +- Captures complex relationships naturally +- Enables flexible multi-hop queries +- Supports domain-agnostic modeling +- Facilitates data integration across sources + +**iSamples Graph Example:** +``` +MaterialSampleRecord (pottery sherd) + | + |-- produced_by --> SamplingEvent (2023 excavation) + | | + | |-- sampling_site --> SamplingSite (Γ‡atalhΓΆyΓΌk) + | | | + | | |-- site_location --> GeospatialCoordLocation + | | + | |-- sample_location --> GeospatialCoordLocation + | | + | |-- responsibility --> Agent (Dr. Smith) + | + |-- has_material_category --> IdentifiedConcept (Earthenware) + | + |-- keywords --> IdentifiedConcept (Neolithic, Pottery) +``` + +--- + +## The 8 Entity Types (oTypes) + +These are the **node types** in the iSamples graph. Each node has an `otype` field that identifies its type. + +### 1. MaterialSampleRecord + +**What it represents:** The physical sample itself - the core entity in iSamples. + +**Domain examples:** +- Archaeology: Pottery sherd, bone fragment, textile +- Geology: Rock core, mineral specimen, sediment sample +- Biology: Tissue sample, DNA extract, whole organism + +**Key properties:** +- `pid` - Unique identifier (typically IGSN) +- `label` - Human-readable name +- `description` - Detailed description +- `sample_identifier` - Canonical sample ID + +**Required relationships:** +- Must have `produced_by` β†’ SamplingEvent +- Must have `has_material_category` β†’ IdentifiedConcept +- Must have `has_context_category` β†’ IdentifiedConcept +- Must have `has_sample_object_type` β†’ IdentifiedConcept + +**Example:** +```yaml +otype: MaterialSampleRecord +pid: "igsn:SSH000001" +label: "Ceramic bowl fragment from Trench 5" +description: "Red-slipped pottery sherd with geometric decoration" +``` + +--- + +### 2. SamplingEvent + +**What it represents:** The activity that collected/created the sample. + +**Domain examples:** +- Archaeology: Excavation, surface collection, test pit +- Geology: Core drilling, outcrop sampling, dredge +- Biology: Field collection, trap deployment, specimen preparation + +**Key properties:** +- `pid` - Event identifier +- `label` - Event name/code +- `description` - Sampling procedure details +- `result_time` - When sample was collected (date/datetime) +- `has_feature_of_interest` - What was sampled +- `project` - Project identifier or name + +**Relationships:** +- Links TO: SamplingSite, GeospatialCoordLocation, Agent +- Links FROM: MaterialSampleRecord (via `produced_by`) + +**Example:** +```yaml +otype: SamplingEvent +pid: "event:2023-catal-t5-001" +label: "Trench 5, Level 3, 2023-07-15" +result_time: "2023-07-15" +has_feature_of_interest: "Neolithic architectural feature" +``` + +--- + +### 3. SamplingSite + +**What it represents:** Named location where sampling occurred. + +**Domain examples:** +- Archaeology: Site name (Γ‡atalhΓΆyΓΌk, Pompeii) +- Geology: Formation/locality name (Yellowstone Core Site YC-01) +- Biology: Research station, reef system, forest plot + +**Key properties:** +- `pid` - Site identifier +- `label` - Site name +- `description` - Site description +- `place_name` - One or more names for the site + +**Relationships:** +- Links TO: GeospatialCoordLocation (via `site_location`) +- Links TO: SamplingSite (via `is_part_of` for nested sites) +- Links FROM: SamplingEvent (via `sampling_site`) + +**Example:** +```yaml +otype: SamplingSite +pid: "site:catalhoyuk-south" +label: "Γ‡atalhΓΆyΓΌk South Area" +place_name: ["Γ‡atalhΓΆyΓΌk", "Γ‡atal HΓΆyΓΌk", "Chatal Huyuk"] +``` + +--- + +### 4. GeospatialCoordLocation + +**What it represents:** Precise geographic coordinates (WGS84). + +**Key properties:** +- `pid` - Coordinate identifier +- `latitude` - Decimal degrees (-90 to 90) +- `longitude` - Decimal degrees (-180 to 180) +- `elevation` - String with value, units, datum (e.g., "401 m above mean sea level") +- `obfuscated` - Boolean flag if coordinates are intentionally imprecise + +**Relationships:** +- Links FROM: SamplingEvent (via `sample_location`) +- Links FROM: SamplingSite (via `site_location`) + +**Example:** +```yaml +otype: GeospatialCoordLocation +pid: "coord:37.6665-32.8274" +latitude: 37.6665 +longitude: 32.8274 +elevation: "1015 m above mean sea level" +obfuscated: false +``` + +--- + +### 5. IdentifiedConcept + +**What it represents:** Controlled vocabulary terms for classification and keywords. + +**Used for:** +- Material types (rock, ceramic, DNA, etc.) +- Sampled feature types (terrestrial, marine, archaeological) +- Sample object types (core, hand specimen, thin section) +- Free-text keywords for discovery + +**Key properties:** +- `pid` - Concept URI (from controlled vocabulary) +- `label` - Human-readable term +- `scheme_name` - Vocabulary name +- `scheme_uri` - Vocabulary identifier + +**Vocabularies:** +- [Material Type Vocabulary](https://w3id.org/isample/vocabulary/material/) +- [Sampled Feature Vocabulary](https://w3id.org/isample/vocabulary/sampledfeature/) +- [Material Sample Object Type Vocabulary](https://w3id.org/isample/vocabulary/materialsampleobjecttype/) + +**Example:** +```yaml +otype: IdentifiedConcept +pid: "https://w3id.org/isample/vocabulary/material/0.9/earthenware" +label: "Earthenware" +scheme_name: "iSamples Material Type Vocabulary" +scheme_uri: "https://w3id.org/isample/vocabulary/material/" +``` + +--- + +### 6. Agent + +**What it represents:** Person or organization with a role in sample lifecycle. + +**Roles:** +- Collector (sampling event) +- Registrant (sample registration) +- Curator (sample storage) + +**Key properties:** +- `pid` - Agent identifier (ORCID preferred) +- `name` - Person/organization name +- `affiliation` - Institutional affiliation +- `contact_information` - Email, phone, address +- `role` - Role relative to sample + +**Example:** +```yaml +otype: Agent +pid: "https://orcid.org/0000-0002-1234-5678" +name: "Dr. Jane Smith" +affiliation: "University of Example" +contact_information: "jsmith@example.edu" +role: "Principal Investigator" +``` + +--- + +### 7. MaterialSampleCuration + +**What it represents:** Information about sample storage, access, and curation history. + +**Key properties:** +- `pid` - Curation record identifier +- `label` - Collection/storage name +- `description` - Curation procedures +- `curation_location` - Where sample is stored +- `access_constraints` - Access restrictions + +**Relationships:** +- Links TO: Agent (via `responsibility`) +- Links FROM: MaterialSampleRecord (via `curation`) + +**Example:** +```yaml +otype: MaterialSampleCuration +pid: "curation:smithsonian-nmnh-123" +label: "Smithsonian NMNH Anthropology Collection" +curation_location: "National Museum of Natural History, Washington DC" +access_constraints: "Appointment required, no destructive sampling" +``` + +--- + +### 8. SampleRelation + +**What it represents:** Relationship between samples (parent-child, sibling, etc.). + +**Use cases:** +- Parent sample β†’ subsample +- Whole organism β†’ tissue β†’ DNA extract +- Core sample β†’ thin section β†’ analysis aliquot + +**Key properties:** +- `pid` - Relation identifier +- `label` - Relation description +- `description` - Details of relationship +- `relationship` - Relation type (e.g., "derivedFrom") +- `target` - PID of related sample + +**Example:** +```yaml +otype: SampleRelation +pid: "relation:subsample-001" +label: "Subsample for radiocarbon dating" +relationship: "derivedFrom" +target: "igsn:SSH000001" +``` + +--- + +## The 14 Relationship Types (Predicates) + +These are the **edge types** (predicates) that connect nodes in the iSamples graph. Each edge has: +- `s` (subject) - Source node row_id +- `p` (predicate) - Relationship type +- `o` (object) - Target node row_id(s) + +### From MaterialSampleRecord (8 predicates) + +| Predicate | Target Type | Cardinality | Description | +|-----------|-------------|-------------|-------------| +| `produced_by` | SamplingEvent | One | Links sample to collection event | +| `has_material_category` | IdentifiedConcept | Many | What is it made of? | +| `has_context_category` | IdentifiedConcept | Many | What domain/environment? | +| `has_sample_object_type` | IdentifiedConcept | Many | What form does it take? | +| `keywords` | IdentifiedConcept | Many | Discovery keywords | +| `registrant` | Agent | One | Who registered this sample? | +| `curation` | MaterialSampleCuration | One | Where is it stored? | +| `related_resource` | SampleRelation | Many | Links to related samples | + +### From SamplingEvent (4 predicates) + +| Predicate | Target Type | Cardinality | Description | +|-----------|-------------|-------------|-------------| +| `sampling_site` | SamplingSite | One | Where was it collected? | +| `sample_location` | GeospatialCoordLocation | One | Precise coordinates | +| `responsibility` | Agent | Many | Who collected it? | +| `has_context_category` | IdentifiedConcept | Many | Sampling context | + +### From SamplingSite (1 predicate) + +| Predicate | Target Type | Cardinality | Description | +|-----------|-------------|-------------|-------------| +| `site_location` | GeospatialCoordLocation | One | Site coordinates | + +### From MaterialSampleCuration (1 predicate) + +| Predicate | Target Type | Cardinality | Description | +|-----------|-------------|-------------|-------------| +| `responsibility` | Agent | Many | Who curates it? | + +--- + +## The 14 Sentence Types + +Think of these as the **grammar** of iSamples metadata. Each represents a valid statement you can make about samples: + +### Core Sample Provenance (3 sentence types) + +1. **"This sample was produced by this sampling event"** + - `MaterialSampleRecord --produced_by--> SamplingEvent` + - Every sample MUST have this relationship + - Links sample to its collection context + +2. **"This sample is made of this material type"** + - `MaterialSampleRecord --has_material_category--> IdentifiedConcept` + - Required: At least one material classification + - Example: Earthenware, Basalt, DNA + +3. **"This sample represents this context"** + - `MaterialSampleRecord --has_context_category--> IdentifiedConcept` + - Required: Domain classification + - Example: Terrestrial/Archaeological, Marine Biome + +### Sample Classification (2 sentence types) + +4. **"This sample takes this physical form"** + - `MaterialSampleRecord --has_sample_object_type--> IdentifiedConcept` + - Required: Object type + - Example: Sherd, Core, Specimen + +5. **"This sample is described by these keywords"** + - `MaterialSampleRecord --keywords--> IdentifiedConcept` + - Optional: Discovery keywords + - Example: Neolithic, Pottery, Red-slipped + +### Sample Stewardship (2 sentence types) + +6. **"This sample was registered by this person"** + - `MaterialSampleRecord --registrant--> Agent` + - Optional: Who created the metadata record + - Example: Data curator, Collection manager + +7. **"This sample is stored here"** + - `MaterialSampleRecord --curation--> MaterialSampleCuration` + - Optional: Storage and access information + +### Sample Relationships (1 sentence type) + +8. **"This sample relates to that sample"** + - `MaterialSampleRecord --related_resource--> SampleRelation` + - Optional: Parent-child, sibling relationships + - Example: Subsample, Derived from + +### Event Location (2 sentence types) + +9. **"This event occurred at this site"** + - `SamplingEvent --sampling_site--> SamplingSite` + - Optional: Named sampling location + - Example: Γ‡atalhΓΆyΓΌk, Yellowstone Core Site + +10. **"This event occurred at these coordinates"** + - `SamplingEvent --sample_location--> GeospatialCoordLocation` + - Optional but common: Precise sample coordinates + - Example: 37.6665Β°N, 32.8274Β°E + +### Event Responsibility (2 sentence types) + +11. **"This person collected at this event"** + - `SamplingEvent --responsibility--> Agent` + - Optional: Field collectors, project team + +12. **"This event belongs to this context"** + - `SamplingEvent --has_context_category--> IdentifiedConcept` + - Optional: Event-level context classification + +### Site Location (1 sentence type) + +13. **"This site is located at these coordinates"** + - `SamplingSite --site_location--> GeospatialCoordLocation` + - Optional: Site-level coordinates (less precise than sample) + +### Curation Responsibility (1 sentence type) + +14. **"This person curates this collection"** + - `MaterialSampleCuration --responsibility--> Agent` + - Optional: Curators, collection managers + +--- + +## Why This Structure? + +### Multi-Hop Traversal by Design + +**Finding a sample's coordinates requires multiple hops:** + +``` +MaterialSampleRecord + β†’ produced_by β†’ SamplingEvent + β†’ sample_location β†’ GeospatialCoordLocation +``` + +**Why not store coordinates directly on the sample?** + +βœ… **Benefits of separation:** +1. **Shared locations** - Multiple samples from same event share one coordinate +2. **Different precision** - Site coordinates vs exact sample coordinates +3. **Reusable events** - One event can produce many samples +4. **Flexible modeling** - Some samples have site but not precise coords + +❌ **Drawbacks of flat structure:** +- Duplicate coordinates across samples +- Can't distinguish site-level vs sample-level precision +- Harder to maintain data consistency + +### Domain-Agnostic Design + +The 8 entity types work across **all scientific domains**: + +- **Archaeology:** Pottery, bones, charcoal β†’ Terrestrial/Archaeological context +- **Geology:** Cores, outcrops, minerals β†’ Terrestrial/Subsurface context +- **Biology:** Tissue, DNA, specimens β†’ Marine Biome or Terrestrial context + +**Same schema, different values** - This is true domain-agnostic modeling. + +### Graph Query Flexibility + +**Example queries enabled by graph structure:** + +1. "Find all samples collected by Agent X" + - `MaterialSampleRecord β†’ produced_by β†’ SamplingEvent β†’ responsibility β†’ Agent` + +2. "Find all samples within 10km of a location" + - `MaterialSampleRecord β†’ produced_by β†’ SamplingEvent β†’ sample_location β†’ GeospatialCoordLocation` + +3. "Find all earthenware samples with keywords 'Neolithic' AND 'pottery'" + - `MaterialSampleRecord β†’ has_material_category β†’ IdentifiedConcept` (Earthenware) + - `MaterialSampleRecord β†’ keywords β†’ IdentifiedConcept` (Neolithic, Pottery) + +4. "Find parent sample for a given subsample" + - `MaterialSampleRecord β†’ related_resource β†’ SampleRelation` (where relationship="derivedFrom") + +--- + +## Graph Traversal Patterns + +### Pattern 1: Sample β†’ Coordinates (2-3 hops) + +**Path:** +``` +Sample β†’ produced_by β†’ Event β†’ sample_location β†’ Coords +``` + +**SQL example:** +```sql +SELECT + sample.pid AS sample_id, + coords.latitude, + coords.longitude +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'sample_location' +JOIN pqg AS coords ON coords.row_id = ANY(edge2.o) +WHERE sample.otype = 'MaterialSampleRecord' +``` + +### Pattern 2: Sample β†’ Site Name (3 hops) + +**Path:** +``` +Sample β†’ produced_by β†’ Event β†’ sampling_site β†’ Site +``` + +**SQL example:** +```sql +SELECT + sample.label AS sample_label, + site.label AS site_name, + site.place_name +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'sampling_site' +JOIN pqg AS site ON site.row_id = ANY(edge2.o) +WHERE sample.otype = 'MaterialSampleRecord' +``` + +### Pattern 3: Sample β†’ Collector (2-3 hops) + +**Path:** +``` +Sample β†’ produced_by β†’ Event β†’ responsibility β†’ Agent +``` + +**SQL example:** +```sql +SELECT + sample.label AS sample_label, + agent.name AS collector_name, + event.result_time +FROM pqg AS sample +JOIN pqg AS edge1 ON edge1.s = sample.row_id AND edge1.p = 'produced_by' +JOIN pqg AS event ON event.row_id = ANY(edge1.o) +JOIN pqg AS edge2 ON edge2.s = event.row_id AND edge2.p = 'responsibility' +JOIN pqg AS agent ON agent.row_id = ANY(edge2.o) +WHERE sample.otype = 'MaterialSampleRecord' +``` + +### Pattern 4: Material Type Filter (1 hop) + +**Path:** +``` +Sample β†’ has_material_category β†’ Concept +``` + +**SQL example:** +```sql +SELECT + sample.pid, + sample.label, + concept.label AS material_type +FROM pqg AS sample +JOIN pqg AS edge ON edge.s = sample.row_id AND edge.p = 'has_material_category' +JOIN pqg AS concept ON concept.row_id = ANY(edge.o) +WHERE concept.label = 'Earthenware' +``` + +--- + +## Storage Format + +### Unified Table Structure + +PQG stores **both nodes and edges in a single table**: + +```sql +CREATE TABLE pqg ( + row_id INTEGER PRIMARY KEY, + pid VARCHAR UNIQUE NOT NULL, + otype VARCHAR, -- Node type or '_edge_' + + -- Edge fields (NULL for non-edge nodes) + s INTEGER, -- Subject row_id + p VARCHAR, -- Predicate + o INTEGER[], -- Object row_id(s) + n VARCHAR, -- Named graph + + -- Entity properties (NULL for edges) + label VARCHAR, + description TEXT, + latitude DECIMAL, + longitude DECIMAL, + elevation VARCHAR, + ... +); +``` + +### Node Rows + +**Example: MaterialSampleRecord node** +``` +row_id: 1 +pid: "igsn:SSH000001" +otype: "MaterialSampleRecord" +s: NULL +p: NULL +o: NULL +n: NULL +label: "Ceramic bowl fragment" +description: "Red-slipped pottery..." +``` + +### Edge Rows + +**Example: produced_by edge** +``` +row_id: 1001 +pid: "edge_12345" +otype: "_edge_" +s: 1 (sample row_id) +p: "produced_by" +o: [2] (event row_id) +n: NULL +label: NULL +description: NULL +``` + +### Query Pattern + +**Find all edges of a specific type:** +```sql +SELECT + subject.pid AS subject_pid, + edge.p AS predicate, + object.pid AS object_pid +FROM pqg AS edge +JOIN pqg AS subject ON edge.s = subject.row_id +JOIN pqg AS object ON object.row_id = ANY(edge.o) +WHERE edge.otype = '_edge_' + AND subject.otype = 'MaterialSampleRecord' + AND edge.p = 'produced_by' + AND object.otype = 'SamplingEvent' +``` + +--- + +## Summary + +**The iSamples property graph is defined by:** + +βœ… **8 entity types (nodes)** - MaterialSampleRecord, SamplingEvent, SamplingSite, GeospatialCoordLocation, IdentifiedConcept, Agent, MaterialSampleCuration, SampleRelation + +βœ… **14 relationship types (edges)** - The complete grammar of valid connections + +βœ… **14 sentence types** - All possible statements you can make about samples + +**Key takeaway:** Understanding these building blocks enables you to: +- Query iSamples data effectively +- Validate metadata completeness +- Integrate new data sources +- Build tools that work across domains + +**Next steps:** +- [PREDICATES_REFERENCE.md](./PREDICATES_REFERENCE.md) - Detailed reference for each predicate +- [EXAMPLES_BY_DOMAIN.md](./EXAMPLES_BY_DOMAIN.md) - Real-world examples +- [QUERYING_THE_GRAPH.md](./QUERYING_THE_GRAPH.md) - Query patterns and SQL + +--- + +**Document Version:** 1.0 +**Last Updated:** 2025-11-14 +**Schema Version:** 20250207 (MaterialSampleRecord) +**Author:** Claude Code (Sonnet 4.5) based on iSamples LinkML schema analysis