From c517bd2b3395fd676bdb136826f5a2e6a8955958 Mon Sep 17 00:00:00 2001 From: crtahlin Date: Thu, 6 Mar 2025 17:12:42 +0100 Subject: [PATCH 01/20] add title and edit header --- SWIPs/swip-data-provenance.md | 54 +++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) create mode 100644 SWIPs/swip-data-provenance.md diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md new file mode 100644 index 0000000..be5efe4 --- /dev/null +++ b/SWIPs/swip-data-provenance.md @@ -0,0 +1,54 @@ +--- +SWIP: +title: Data provenance on Swarm +author: Črt Ahlin (@crtahlin) +discussions-to: https://discord.com/channels/799027393297514537/1239813439136993280 +status: WIP +type: Informational + +created: 2025-03-06 +requires (*optional): +replaces (*optional): +--- + + +This is the suggested template for new SWIPs. + +Note that a SWIP number will be assigned by an editor. When opening a pull request to submit your SWIP, please use an abbreviated title in the filename, `SWIP-draft_title_abbrev.md`. + +The title should be 44 characters or less. + +## Simple Summary + +If you can't explain it simply, you don't understand it well enough." Provide a simplified and layman-accessible explanation of the SWIP. + +## Abstract + +A short (~200 word) description of the technical issue being addressed. + +## Motivation + +The motivation is critical for SWIPs that want to change the Swarm protocol. It should clearly explain why the existing protocol specification is inadequate to address the problem that the SWIP solves. SWIP submissions without sufficient motivation may be rejected outright. + +## Specification + +The technical specification should describe the syntax and semantics of any new feature. The specification should be detailed enough to allow competing, interoperable implementations for the current Swarm platform and future client implementations. + +## Rationale + +The rationale fleshes out the specification by describing what motivated the design and why particular design decisions were made. It should describe alternate designs that were considered and related work, e.g. how the feature is supported in other languages. The rationale may also provide evidence of consensus within the community, and should discuss important objections or concerns raised during discussion. + +## Backwards Compatibility + +All SWIPs that introduce backwards incompatibilities must include a section describing these incompatibilities and their severity. The SWIP must explain how the author proposes to deal with these incompatibilities. SWIP submissions without a sufficient backwards compatibility treatise may be rejected outright. + +## Test Cases + +Test cases for an implementation are mandatory for SWIPs that are affecting changes to data and message formats. Other SWIPs can choose to include links to test cases if applicable. + +## Implementation + +The implementations must be completed before any SWIP is given status "Final", but it need not be completed before the SWIP is accepted. While there is merit to the approach of reaching consensus on the specification and rationale before writing code, the principle of "rough consensus and running code" is still useful when it comes to resolving many discussions of API details. + +## Copyright +Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). From 90a41fd32cfa69ea793280b277af75d1a6b4bd88 Mon Sep 17 00:00:00 2001 From: crtahlin Date: Thu, 6 Mar 2025 17:26:13 +0100 Subject: [PATCH 02/20] add draft sections sans specification section --- SWIPs/swip-data-provenance.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index be5efe4..e6c2ed8 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -20,15 +20,15 @@ The title should be 44 characters or less. ## Simple Summary -If you can't explain it simply, you don't understand it well enough." Provide a simplified and layman-accessible explanation of the SWIP. +This SWIP outlines how Swarm decentralized storage can be utilized as a trusted third-party solution for storing and managing data provenance. It highlights potential use cases, technical considerations, and business benefits for leveraging Swarm in provenance-related applications. ## Abstract -A short (~200 word) description of the technical issue being addressed. +Provenance ensures accountability and integrity by tracking the origins and transformations of data. This SWIP explores how Swarm’s decentralized storage can serve as a foundation for provenance systems by leveraging its immutability, trustless design, and scalability. The document discusses compatibility with existing standards (e.g., W3C PROV, Data & Trust Alliance spec), technical requirements for uploading and accessing provenance data, and considerations for privacy and encryption. It also addresses potential extensions, such as integrating AI agents for data validation and interpretation. ## Motivation -The motivation is critical for SWIPs that want to change the Swarm protocol. It should clearly explain why the existing protocol specification is inadequate to address the problem that the SWIP solves. SWIP submissions without sufficient motivation may be rejected outright. +Data provenance is critical for ethical AI development, regulatory compliance, and ensuring trust in data-driven systems. Current solutions often rely on centralized storage or public blockchains, which have limitations in scalability, privacy, or cost. Swarm offers a unique alternative as a decentralized storage network that combines immutability with flexibility. This SWIP aims to demonstrate how Swarm can address key challenges in provenance systems while aligning with emerging standards and market needs. ## Specification @@ -36,11 +36,12 @@ The technical specification should describe the syntax and semantics of any new ## Rationale -The rationale fleshes out the specification by describing what motivated the design and why particular design decisions were made. It should describe alternate designs that were considered and related work, e.g. how the feature is supported in other languages. The rationale may also provide evidence of consensus within the community, and should discuss important objections or concerns raised during discussion. +Swarm’s decentralized nature makes it ideal for acting as a trusted third party in provenance systems. Unlike public blockchains, it supports larger data sizes without compromising privacy when encryption is used. While standards like W3C PROV are comprehensive, they may be too complex for some use cases; simpler alternatives like the Data & Trust Alliance spec are more practical for initial implementations. This approach allows flexibility while ensuring compatibility with existing standards. ## Backwards Compatibility -All SWIPs that introduce backwards incompatibilities must include a section describing these incompatibilities and their severity. The SWIP must explain how the author proposes to deal with these incompatibilities. SWIP submissions without a sufficient backwards compatibility treatise may be rejected outright. +This proposal does not introduce changes to Swarm’s core functionality or protocols. It leverages existing capabilities such as immutable storage and reference hashes, ensuring full compatibility with current implementations. + ## Test Cases @@ -48,7 +49,7 @@ Test cases for an implementation are mandatory for SWIPs that are affecting chan ## Implementation -The implementations must be completed before any SWIP is given status "Final", but it need not be completed before the SWIP is accepted. While there is merit to the approach of reaching consensus on the specification and rationale before writing code, the principle of "rough consensus and running code" is still useful when it comes to resolving many discussions of API details. +A prototype toolkit will be developed as part of the fellowship deliverables. This toolkit will include features for uploading, retrieving, validating, and extending storage of provenance data on Swarm. ## Copyright Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). From 8d6882b471d26488851df61dd538ccdb339cce9f Mon Sep 17 00:00:00 2001 From: crtahlin Date: Fri, 7 Mar 2025 19:40:12 +0100 Subject: [PATCH 03/20] add the specification section --- SWIPs/swip-data-provenance.md | 42 ++++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index e6c2ed8..d970846 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -32,15 +32,55 @@ Data provenance is critical for ethical AI development, regulatory compliance, a ## Specification -The technical specification should describe the syntax and semantics of any new feature. The specification should be detailed enough to allow competing, interoperable implementations for the current Swarm platform and future client implementations. + + +### 1. Provenance Record Structure +The provenance file will be stored in JSON format with the following structure: + +```json +{ + "provenance_metadata_id": "UUID string", + "content_hash": "sha256:9f86d...a9e", + "provenance_standard": "DaTA v1.0.0", + "data_swarm_reference": "", + "stamp_id": "0xfe2f...c3a1" +} +``` + +*This structure allows a unique identification of the metadata itself (UUID), a reference to the actual provenance data via a Swarm hash, a declaration of the provenance standard used, and the stamp associated with the storage.* + +### 2. Toolkit Features +The toolkit provides functionalities to interact with the Swarm network: + +- **Provenance Metadata Upload**: Uploads the JSON metadata file to Swarm. +- **Provenance Data Upload**: Uploads the actual provenance data file to Swarm. +- **Provenance Metadata Access**: Retrieves the JSON metadata file using its Swarm reference hash. +- **Provenance Data Access**: Retrieves the actual provenance data file using its Swarm reference hash. +- **Check TTL**: Queries the stamp associated with the storage to determine the remaining storage duration for both the metadata and the data. +- **Stamp Top-Up**: Extends the storage period for the associated stamp (both the metadata and the data are extended). +- **Data Existence Check**: Verifies whether the data (provenance metadata and/or actual provenance data) exists on the Swarm network. + +These features can utilize a Swarm gateway or a local Bee node, as per the user’s choice. + +### 3. Privacy Controls + +*Privacy controls are optional at later stages and are left to the discretion of the user. The user can choose to encrypt the data before uploading it.* + ## Rationale Swarm’s decentralized nature makes it ideal for acting as a trusted third party in provenance systems. Unlike public blockchains, it supports larger data sizes without compromising privacy when encryption is used. While standards like W3C PROV are comprehensive, they may be too complex for some use cases; simpler alternatives like the Data & Trust Alliance spec are more practical for initial implementations. This approach allows flexibility while ensuring compatibility with existing standards. +- Adopting Existing Standards: By relying on the simpler DaTA spec for most use cases (with the option for W3C PROV), the solution avoids over-complication while maintaining interoperability. +- Swarm’s Suitability: Its decentralized, immutable nature makes it an ideal candidate for storing provenance data, which benefits from being tamper-resistant and verifiable. +- AI Integration: Embedding AI within the upload flow can not only prevent privacy breaches but also assist in interpreting provenance data—a value-add that enhances user trust and usability. + +A comparative advantage over alternatives (e.g., centralized storage solutions or blockchain-based systems) is seen in Swarm’s cost-effectiveness and scalability, without compromising security or data integrity. + ## Backwards Compatibility This proposal does not introduce changes to Swarm’s core functionality or protocols. It leverages existing capabilities such as immutable storage and reference hashes, ensuring full compatibility with current implementations. +It operates on top of the existing Swarm infrastructure and adheres to established file storage and retrieval methods. All existing Swarm tools (like the Bee CLI and Dashboard) remain fully compatible with this additional use case. ## Test Cases From 40b1917a53a126145155c6208fae58fd2e59aa1c Mon Sep 17 00:00:00 2001 From: crtahlin Date: Fri, 7 Mar 2025 19:47:40 +0100 Subject: [PATCH 04/20] add minor explanation --- SWIPs/swip-data-provenance.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index d970846..a0b3d03 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -20,7 +20,7 @@ The title should be 44 characters or less. ## Simple Summary -This SWIP outlines how Swarm decentralized storage can be utilized as a trusted third-party solution for storing and managing data provenance. It highlights potential use cases, technical considerations, and business benefits for leveraging Swarm in provenance-related applications. +This SWIP outlines how Swarm decentralized storage can be utilized as a trusted third-party solution for storing and managing data provenance - tracking the origins and transformations of data. It highlights potential use cases, technical considerations, and business benefits for leveraging Swarm in provenance-related applications. The intended audience are developers, that would be using Swarm as provenance recording solution. ## Abstract From bb1766211bb018a67c00370b4c34db9e8e4810c2 Mon Sep 17 00:00:00 2001 From: crtahlin Date: Wed, 12 Mar 2025 10:13:56 +0100 Subject: [PATCH 05/20] change title --- SWIPs/swip-data-provenance.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index a0b3d03..afc8a46 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -1,6 +1,6 @@ --- SWIP: -title: Data provenance on Swarm +title: Swarm-Based Provenance Framework for Data Accountability and Trust author: Črt Ahlin (@crtahlin) discussions-to: https://discord.com/channels/799027393297514537/1239813439136993280 status: WIP From 3faf8caaed9d5ce5cc9ff7a717a28f3d1f5df0ef Mon Sep 17 00:00:00 2001 From: crtahlin Date: Wed, 12 Mar 2025 10:38:14 +0100 Subject: [PATCH 06/20] revise summary and abstract --- SWIPs/swip-data-provenance.md | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index afc8a46..46a8943 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -20,11 +20,30 @@ The title should be 44 characters or less. ## Simple Summary -This SWIP outlines how Swarm decentralized storage can be utilized as a trusted third-party solution for storing and managing data provenance - tracking the origins and transformations of data. It highlights potential use cases, technical considerations, and business benefits for leveraging Swarm in provenance-related applications. The intended audience are developers, that would be using Swarm as provenance recording solution. + +This SWIP proposes Swarm as a decentralized storage layer for provenance metadata and data. Provenance, the documented history of a dataset's origin and transformations, is increasingly important for regulatory compliance, ethical AI, and data accountability. A toolkit will provide utilities to: +- Upload/Download: Store and retrieve provenance files (in any format) with Swarm reference hashes. +- Metadata Management: Track storage validity (TTL) and extend it via stamp top-ups. +- Data Integrity: Verify content through SHA-256 hashes. + +The framework does not enforce specific provenance standards but ensures compatibility by decoupling metadata (structured JSON) from the actual provenance data (stored as arbitrary files). Developers and enterprises retain full control over their data format and privacy measures. ## Abstract -Provenance ensures accountability and integrity by tracking the origins and transformations of data. This SWIP explores how Swarm’s decentralized storage can serve as a foundation for provenance systems by leveraging its immutability, trustless design, and scalability. The document discusses compatibility with existing standards (e.g., W3C PROV, Data & Trust Alliance spec), technical requirements for uploading and accessing provenance data, and considerations for privacy and encryption. It also addresses potential extensions, such as integrating AI agents for data validation and interpretation. + +Provenance systems require immutable, scalable storage to track data lineage effectively. This SWIP leverages Swarm’s decentralized network to: +- Store Provenance Data: Users upload files in any format (e.g., W3C PROV-JSONLD, DaTA spec, or custom schemas). +- Manage Metadata: A JSON wrapper includes: + - `provenance_metadata_id` (UUID for unique identification) + - `data_swarm_reference` (Swarm hash pointing to the provenance file) + - `stamp_id` (for TTL tracking and renewal) + - `content_hash` (SHA-256 for integrity checks) + - `provenance_standard` (optional field for self-declared standards) +- Ensure Flexibility: No Swarm-level validation of provenance formats—compatibility is achieved by design. + +A prototype toolkit (developed under the DataFund Fellowship) will provide CLI and API access to Swarm, enabling integration into existing workflows. Privacy and encryption remain optional, allowing users to comply with regulations like GDPR independently. + + ## Motivation From 4a9a7cd67431881b206423529fd0d97232ed6e99 Mon Sep 17 00:00:00 2001 From: crtahlin Date: Wed, 12 Mar 2025 10:46:25 +0100 Subject: [PATCH 07/20] rewrite motivation --- SWIPs/swip-data-provenance.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index 46a8943..ee29cb5 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -47,7 +47,16 @@ A prototype toolkit (developed under the DataFund Fellowship) will provide CLI a ## Motivation -Data provenance is critical for ethical AI development, regulatory compliance, and ensuring trust in data-driven systems. Current solutions often rely on centralized storage or public blockchains, which have limitations in scalability, privacy, or cost. Swarm offers a unique alternative as a decentralized storage network that combines immutability with flexibility. This SWIP aims to demonstrate how Swarm can address key challenges in provenance systems while aligning with emerging standards and market needs. +Data provenance is critical for ethical AI development, regulatory compliance, and ensuring trust in data-driven systems. Existing provenance solutions often face challenges in terms of vendor lock-in, scalability, and privacy. Centralized systems create dependencies and potential single points of failure. Public blockchains, while immutable, can be costly for large-scale data storage and may not adequately address privacy concerns. + +Swarm offers a compelling alternative by acting as a trustless, decentralized storage network: + +- **Trusted 3rd Party**: Swarm's decentralized architecture serves as a neutral platform for recording provenance, eliminating single points of control. +- **Cost Considerations**: While centralized cloud storage may offer lower costs for simple storage, Swarm provides a more cost-competitive option compared to blockchain-based solutions. +- **Interoperability**: The toolkit is designed to accommodate various provenance standards (e.g., DaTA, W3C PROV) without enforcing a specific format, allowing users to adopt the standard that best suits their needs. + +This proposal aims to align with the Data Spaces Support Centre blueprint for a Technical Building Block covering Provenance & Traceability. By enabling easy uploading, downloading, and management of provenance files, the toolkit empowers users to meet emerging regulatory requirements, such as those outlined in the EU AI Act, and to establish trust and accountability in data-driven systems. + ## Specification From b1facc91f53176fb6490dca346183e14a9f5ce0b Mon Sep 17 00:00:00 2001 From: crtahlin Date: Wed, 12 Mar 2025 11:17:36 +0100 Subject: [PATCH 08/20] revise specification section --- SWIPs/swip-data-provenance.md | 58 +++++++++++++++++++++++++---------- 1 file changed, 41 insertions(+), 17 deletions(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index ee29cb5..8eb6440 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -62,37 +62,61 @@ This proposal aims to align with the Data Spaces Support Centre blueprint for a -### 1. Provenance Record Structure -The provenance file will be stored in JSON format with the following structure: +### 1. Provenance Record Structure +The provenance record will be stored as a single JSON file containing both metadata and the actual provenance data. The structure is: ```json { - "provenance_metadata_id": "UUID string", "content_hash": "sha256:9f86d...a9e", "provenance_standard": "DaTA v1.0.0", - "data_swarm_reference": "", + "encryption": "none", // Optional field (e.g., "aes-256-gcm" if encrypted) + "data": "", "stamp_id": "0xfe2f...c3a1" } ``` -*This structure allows a unique identification of the metadata itself (UUID), a reference to the actual provenance data via a Swarm hash, a declaration of the provenance standard used, and the stamp associated with the storage.* +**Key Fields**: +- `content_hash`: SHA-256 hash of the raw provenance data (before Base64 encoding) for integrity verification. +- `provenance_standard`: Declares the standard used (e.g., DaTA, W3C PROV, or custom). +- `encryption`: Optional field to indicate encryption method (default: `"none"`). +- `data`: Base64-encoded provenance data (actual content in any format). +- `stamp_id`: Swarm stamp ID for TTL management. -### 2. Toolkit Features -The toolkit provides functionalities to interact with the Swarm network: +*This structure ensures self-contained provenance records while maintaining compatibility with any standard. The `data` field can store provenance information in formats like JSON, XML, or binary.* -- **Provenance Metadata Upload**: Uploads the JSON metadata file to Swarm. -- **Provenance Data Upload**: Uploads the actual provenance data file to Swarm. -- **Provenance Metadata Access**: Retrieves the JSON metadata file using its Swarm reference hash. -- **Provenance Data Access**: Retrieves the actual provenance data file using its Swarm reference hash. -- **Check TTL**: Queries the stamp associated with the storage to determine the remaining storage duration for both the metadata and the data. -- **Stamp Top-Up**: Extends the storage period for the associated stamp (both the metadata and the data are extended). -- **Data Existence Check**: Verifies whether the data (provenance metadata and/or actual provenance data) exists on the Swarm network. -These features can utilize a Swarm gateway or a local Bee node, as per the user’s choice. +### 2. Toolkit Features +The toolkit interacts with Swarm to manage provenance records via a single JSON file: -### 3. Privacy Controls +- **Upload**: + - Action: Uploads the JSON file to Swarm. + - Workflow: + 1. User prepares provenance data in any format (e.g., DaTA spec, W3C PROV). + 2. Toolkit generates SHA-256 hash of raw data, encodes it to Base64, and wraps it into the JSON structure. + 3. JSON file is uploaded to Swarm via Bee node or gateway. + - Returns: Swarm reference hash for the JSON file. + +- **Download**: + - Action: Retrieves the JSON file using its Swarm reference hash. + - Workflow: Toolkit fetches the JSON, decodes the Base64 `data` field, and verifies integrity via `content_hash`. + +- **Check TTL**: + - Action: Queries remaining storage validity for the JSON file. + - Workflow: Toolkit uses the `stamp_id` to check TTL via Swarm’s stamp management system. + +- **Top-Up**: + - Action: Extends storage validity for the JSON file. + - Workflow: Toolkit tops up the existing `stamp_id` (assumes user has pre-funded their Bee node). + - *Note: Acquiring funds (e.g., xBZZ) is out of scope for this toolkit.* + +- **Existence Check**: + - Action: Verifies if the JSON file exists on Swarm. + - Workflow: Toolkit checks Swarm for the reference hash. + +### 3. Privacy Controls +- **Optional Encryption**: Users may encrypt the raw provenance data before Base64 encoding. The `encryption` field can declare the method (e.g., `aes-256-gcm`), but key management is left to the user. +- **No AI Screening**: Privacy checks (e.g., PII detection) are deferred to future enhancements or third-party services. -*Privacy controls are optional at later stages and are left to the discretion of the user. The user can choose to encrypt the data before uploading it.* ## Rationale From 911a299f064af3a935ad42474aaf4c36d6ca9e10 Mon Sep 17 00:00:00 2001 From: crtahlin Date: Wed, 12 Mar 2025 11:37:34 +0100 Subject: [PATCH 09/20] rewrite rationale --- SWIPs/swip-data-provenance.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index 8eb6440..de60599 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -121,13 +121,17 @@ The toolkit interacts with Swarm to manage provenance records via a single JSON ## Rationale -Swarm’s decentralized nature makes it ideal for acting as a trusted third party in provenance systems. Unlike public blockchains, it supports larger data sizes without compromising privacy when encryption is used. While standards like W3C PROV are comprehensive, they may be too complex for some use cases; simpler alternatives like the Data & Trust Alliance spec are more practical for initial implementations. This approach allows flexibility while ensuring compatibility with existing standards. -- Adopting Existing Standards: By relying on the simpler DaTA spec for most use cases (with the option for W3C PROV), the solution avoids over-complication while maintaining interoperability. -- Swarm’s Suitability: Its decentralized, immutable nature makes it an ideal candidate for storing provenance data, which benefits from being tamper-resistant and verifiable. -- AI Integration: Embedding AI within the upload flow can not only prevent privacy breaches but also assist in interpreting provenance data—a value-add that enhances user trust and usability. +The design of this SWIP centers on providing a flexible and future-proof solution for storing provenance data on Swarm. The decision to use a single JSON file structure simplifies both the management and retrieval processes, enabling easy integration with existing provenance standards without enforcing a specific one. This approach acknowledges that while standards like the Data & Trust Alliance (DaTA) specification and the W3C PROV standard exist, the market is still evolving, and imposing a single standard could limit adoption. + +By storing the actual provenance data as a Base64-encoded string within the JSON structure, we ensure that any file format can be accommodated. This maintains data integrity across different systems and transfer methods, providing users with the freedom to choose the most appropriate format for their specific use case. Optional encryption addresses potential GDPR and privacy concerns, giving users control over their data security while maintaining ease of use. + +To enhance the utility of the provenance data and provide Swarm-specific functionality, additional metadata fields are included in the JSON structure. The inclusion of the `stamp_id` enables users to easily check the storage duration of their provenance data and potentially extend it, aligning with Swarm's storage management mechanisms. Additionally, the `content_hash` provides a means to verify data integrity, particularly when the provenance data is stored in systems other than Swarm, allowing users to match the data across different storage locations. + +Swarm's stamp-based TTL management system aligns with the network's existing storage incentive mechanism, offering users a familiar way to control storage duration and cost. The toolkit approach simplifies the integration of Swarm storage for provenance use cases and allows for future extensions, such as AI agents for data validation, without modifying core functionality. + +Finally, leveraging Swarm's decentralized network as a trusted third party aligns with Data Spaces Support Centre specifications and offers a more scalable and potentially cost-effective solution compared to blockchain-based alternatives. This positions Swarm as a flexible and standards-compatible storage layer for provenance data, catering to emerging market needs while leveraging its unique features. -A comparative advantage over alternatives (e.g., centralized storage solutions or blockchain-based systems) is seen in Swarm’s cost-effectiveness and scalability, without compromising security or data integrity. ## Backwards Compatibility From a8af27b27dd4fda0ba03a4e2d8a336628d1b37ff Mon Sep 17 00:00:00 2001 From: crtahlin Date: Wed, 12 Mar 2025 11:40:50 +0100 Subject: [PATCH 10/20] add test cases --- SWIPs/swip-data-provenance.md | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index de60599..6389faf 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -141,7 +141,22 @@ It operates on top of the existing Swarm infrastructure and adheres to establish ## Test Cases -Test cases for an implementation are mandatory for SWIPs that are affecting changes to data and message formats. Other SWIPs can choose to include links to test cases if applicable. + +## Test Cases + +Given that this is an informational SWIP, the test cases provided here are conceptual and aim to illustrate how the proposed system would work. + +1. **Provenance File Upload and Retrieval**: + - Scenario: A user uploads a JSON file containing provenance metadata and data to Swarm. + - Expected Result: The file is successfully stored on Swarm, and the Swarm reference hash is returned to the user. The user can then retrieve the file using the hash and verify that the content matches the original file. + +2. **TTL Check and Storage Extension**: + - Scenario: A user checks the remaining TTL for a provenance file stored on Swarm. + - Expected Result: The toolkit queries the Swarm network and returns the remaining TTL for the associated stamp. The user can then extend the storage duration by topping up the stamp. + +These test cases provide a high-level overview of the key functionalities of the proposed system and demonstrate its ability to store, retrieve, and manage provenance data on Swarm. + + ## Implementation From 7ea3126eaca9ad4a501454db28c938105d21d06b Mon Sep 17 00:00:00 2001 From: crtahlin Date: Wed, 12 Mar 2025 11:46:57 +0100 Subject: [PATCH 11/20] add implementation section --- SWIPs/swip-data-provenance.md | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index 6389faf..cd4ab71 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -160,7 +160,32 @@ These test cases provide a high-level overview of the key functionalities of the ## Implementation -A prototype toolkit will be developed as part of the fellowship deliverables. This toolkit will include features for uploading, retrieving, validating, and extending storage of provenance data on Swarm. + +## Implementation + +A prototype toolkit is being developed under the DataFund Fellowship with the following components: + +- **Core Functionality**: + - The toolkit will be implemented in Python. + - It will provide command-line access to Swarm for uploading, downloading, and managing provenance files. + - It will support the JSON-based provenance record structure as outlined in the Specification section. + - It will use the Bee client library to interact with the Swarm network. + +- **Key Components**: + - **Upload Module**: Handles the upload of provenance data to Swarm, including the preparation of the JSON metadata file and the selection of appropriate storage options (e.g., encryption). + - **Download Module**: Retrieves provenance data from Swarm, verifies data integrity using the content hash, and presents the data to the user. + - **TTL Management Module**: Provides functionality to check the remaining storage duration (TTL) for a provenance file and extend the storage by topping up the associated stamp. + +- **Integration Points**: + - The toolkit will interact with a Swarm gateway or a local Bee node, as specified by the user. + +- **Future Considerations**: + - The toolkit may incorporate support for additional provenance standards and encryption methods. + - It may also include features for managing complex provenance chains, as outlined in the research document. + - Integration with external services, such as AI agents for data validation and interpretation, and services for attestation and notarization, may be explored in future versions of the toolkit. + +This implementation aims to provide a practical and user-friendly solution for storing and managing provenance data on Swarm, while also laying the groundwork for future extensions and integrations. + ## Copyright Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). From c0a4bbaf8a3755f2b6572b9fd93b020cb148c0e1 Mon Sep 17 00:00:00 2001 From: crtahlin Date: Wed, 12 Mar 2025 11:53:22 +0100 Subject: [PATCH 12/20] update summary and abstract --- SWIPs/swip-data-provenance.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index cd4ab71..3baeeda 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -21,28 +21,27 @@ The title should be 44 characters or less. ## Simple Summary -This SWIP proposes Swarm as a decentralized storage layer for provenance metadata and data. Provenance, the documented history of a dataset's origin and transformations, is increasingly important for regulatory compliance, ethical AI, and data accountability. A toolkit will provide utilities to: + +This SWIP proposes Swarm as a decentralized storage layer for provenance metadata and data. Provenance, the documented history of a dataset's origin and transformations, is increasingly important for regulatory compliance, ethical AI, and data accountability. A command-line toolkit will provide utilities to: - Upload/Download: Store and retrieve provenance files (in any format) with Swarm reference hashes. - Metadata Management: Track storage validity (TTL) and extend it via stamp top-ups. -- Data Integrity: Verify content through SHA-256 hashes. The framework does not enforce specific provenance standards but ensures compatibility by decoupling metadata (structured JSON) from the actual provenance data (stored as arbitrary files). Developers and enterprises retain full control over their data format and privacy measures. + ## Abstract Provenance systems require immutable, scalable storage to track data lineage effectively. This SWIP leverages Swarm’s decentralized network to: - Store Provenance Data: Users upload files in any format (e.g., W3C PROV-JSONLD, DaTA spec, or custom schemas). - Manage Metadata: A JSON wrapper includes: - - `provenance_metadata_id` (UUID for unique identification) - - `data_swarm_reference` (Swarm hash pointing to the provenance file) - - `stamp_id` (for TTL tracking and renewal) - - `content_hash` (SHA-256 for integrity checks) + - `content_hash` (SHA-256 for matching provenance data across different systems) - `provenance_standard` (optional field for self-declared standards) + - `data` (Base64-encoded provenance data) + - `stamp_id` (for TTL tracking and renewal) - Ensure Flexibility: No Swarm-level validation of provenance formats—compatibility is achieved by design. -A prototype toolkit (developed under the DataFund Fellowship) will provide CLI and API access to Swarm, enabling integration into existing workflows. Privacy and encryption remain optional, allowing users to comply with regulations like GDPR independently. - +A prototype toolkit (developed under the DataFund Fellowship) will provide command-line access to Swarm, enabling integration into existing workflows. Privacy and encryption remain optional, allowing users to comply with regulations like GDPR independently. ## Motivation From 9bf80a9e92741c198a431e3b471fa8b7a001d69d Mon Sep 17 00:00:00 2001 From: crtahlin Date: Wed, 12 Mar 2025 12:16:46 +0100 Subject: [PATCH 13/20] add reference to diagram --- SWIPs/swip-data-provenance.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index 3baeeda..5aace30 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -131,6 +131,19 @@ Swarm's stamp-based TTL management system aligns with the network's existing sto Finally, leveraging Swarm's decentralized network as a trusted third party aligns with Data Spaces Support Centre specifications and offers a more scalable and potentially cost-effective solution compared to blockchain-based alternatives. This positions Swarm as a flexible and standards-compatible storage layer for provenance data, catering to emerging market needs while leveraging its unique features. +TODO Add diagram + +**Diagram Description:** + +The diagram illustrates Swarm acting as a trusted third party for storing Provenance and Traceability (P&T) data. It consists of three main entities: + +1. **Consumer:** A data consumer, labeled "Consumer". +2. **Provider:** A data provider, labeled "Provider". +3. **VAS Provider:** Represented as a centralized storage, labeled as "Swarm." + +Both the Consumer and the Provider connect to the Swarm Network. The Swarm Network stores P&T data and takes the role of "Value Added Service (VAS) Provider" to allow other services or tooling on top of it. This VAS Provider contains "P&T", to indicate that they are providing Provenance data. + +This positions Swarm as a flexible and standards-compatible storage layer for provenance data, catering to emerging market needs while leveraging its unique features. ## Backwards Compatibility From c99350748fe062495e85898d6cfd913c3b70e4d3 Mon Sep 17 00:00:00 2001 From: crtahlin Date: Wed, 12 Mar 2025 15:52:45 +0100 Subject: [PATCH 14/20] add didagram --- SWIPs/assets/swip-x-provenance/Provenance-diagram1.svg | 1 + SWIPs/swip-data-provenance.md | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) create mode 100644 SWIPs/assets/swip-x-provenance/Provenance-diagram1.svg diff --git a/SWIPs/assets/swip-x-provenance/Provenance-diagram1.svg b/SWIPs/assets/swip-x-provenance/Provenance-diagram1.svg new file mode 100644 index 0000000..478eacd --- /dev/null +++ b/SWIPs/assets/swip-x-provenance/Provenance-diagram1.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index 5aace30..0897a3d 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -131,7 +131,7 @@ Swarm's stamp-based TTL management system aligns with the network's existing sto Finally, leveraging Swarm's decentralized network as a trusted third party aligns with Data Spaces Support Centre specifications and offers a more scalable and potentially cost-effective solution compared to blockchain-based alternatives. This positions Swarm as a flexible and standards-compatible storage layer for provenance data, catering to emerging market needs while leveraging its unique features. -TODO Add diagram +![Image](assets/swip-x-provenance/Provenance-diagram1.svg "Swarm as trusted third party storage - diagram.") **Diagram Description:** From 09aa2bb3bca661eb75efeefcc722d1a01b1a8188 Mon Sep 17 00:00:00 2001 From: crtahlin Date: Wed, 12 Mar 2025 15:54:39 +0100 Subject: [PATCH 15/20] remove duplicates --- SWIPs/swip-data-provenance.md | 5 ----- 1 file changed, 5 deletions(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index 0897a3d..0d6837a 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -154,8 +154,6 @@ It operates on top of the existing Swarm infrastructure and adheres to establish ## Test Cases -## Test Cases - Given that this is an informational SWIP, the test cases provided here are conceptual and aim to illustrate how the proposed system would work. 1. **Provenance File Upload and Retrieval**: @@ -169,12 +167,9 @@ Given that this is an informational SWIP, the test cases provided here are conce These test cases provide a high-level overview of the key functionalities of the proposed system and demonstrate its ability to store, retrieve, and manage provenance data on Swarm. - ## Implementation -## Implementation - A prototype toolkit is being developed under the DataFund Fellowship with the following components: - **Core Functionality**: From d6511dde9ac8852dde7ffcefc7e66e85407e7e29 Mon Sep 17 00:00:00 2001 From: crtahlin Date: Thu, 13 Mar 2025 15:03:34 +0100 Subject: [PATCH 16/20] shorten title, change date --- SWIPs/swip-data-provenance.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/swip-data-provenance.md index 0d6837a..b3eabee 100644 --- a/SWIPs/swip-data-provenance.md +++ b/SWIPs/swip-data-provenance.md @@ -1,12 +1,12 @@ --- SWIP: -title: Swarm-Based Provenance Framework for Data Accountability and Trust +title: Swarm-Based Data Provenance Framework author: Črt Ahlin (@crtahlin) discussions-to: https://discord.com/channels/799027393297514537/1239813439136993280 status: WIP type: Informational -created: 2025-03-06 +created: 2025-03-13 requires (*optional): replaces (*optional): --- From 4faa6df7c2050fda7e54f162bbe8bd0df7773557 Mon Sep 17 00:00:00 2001 From: crtahlin Date: Thu, 13 Mar 2025 15:07:26 +0100 Subject: [PATCH 17/20] rename file as per instructions --- SWIPs/{swip-data-provenance.md => SWIP-draft_Data_Provenance.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename SWIPs/{swip-data-provenance.md => SWIP-draft_Data_Provenance.md} (100%) diff --git a/SWIPs/swip-data-provenance.md b/SWIPs/SWIP-draft_Data_Provenance.md similarity index 100% rename from SWIPs/swip-data-provenance.md rename to SWIPs/SWIP-draft_Data_Provenance.md From e8f1159f651adcccd1e3283c93783092c5d9ad82 Mon Sep 17 00:00:00 2001 From: crtahlin Date: Thu, 13 Mar 2025 15:22:44 +0100 Subject: [PATCH 18/20] clean up, minor changes --- SWIPs/SWIP-draft_Data_Provenance.md | 20 +++----------------- 1 file changed, 3 insertions(+), 17 deletions(-) diff --git a/SWIPs/SWIP-draft_Data_Provenance.md b/SWIPs/SWIP-draft_Data_Provenance.md index b3eabee..7ef0002 100644 --- a/SWIPs/SWIP-draft_Data_Provenance.md +++ b/SWIPs/SWIP-draft_Data_Provenance.md @@ -12,23 +12,16 @@ replaces (*optional): --- -This is the suggested template for new SWIPs. - -Note that a SWIP number will be assigned by an editor. When opening a pull request to submit your SWIP, please use an abbreviated title in the filename, `SWIP-draft_title_abbrev.md`. - -The title should be 44 characters or less. ## Simple Summary - This SWIP proposes Swarm as a decentralized storage layer for provenance metadata and data. Provenance, the documented history of a dataset's origin and transformations, is increasingly important for regulatory compliance, ethical AI, and data accountability. A command-line toolkit will provide utilities to: - Upload/Download: Store and retrieve provenance files (in any format) with Swarm reference hashes. - Metadata Management: Track storage validity (TTL) and extend it via stamp top-ups. The framework does not enforce specific provenance standards but ensures compatibility by decoupling metadata (structured JSON) from the actual provenance data (stored as arbitrary files). Developers and enterprises retain full control over their data format and privacy measures. - ## Abstract @@ -43,9 +36,9 @@ Provenance systems require immutable, scalable storage to track data lineage eff A prototype toolkit (developed under the DataFund Fellowship) will provide command-line access to Swarm, enabling integration into existing workflows. Privacy and encryption remain optional, allowing users to comply with regulations like GDPR independently. - ## Motivation + Data provenance is critical for ethical AI development, regulatory compliance, and ensuring trust in data-driven systems. Existing provenance solutions often face challenges in terms of vendor lock-in, scalability, and privacy. Centralized systems create dependencies and potential single points of failure. Public blockchains, while immutable, can be costly for large-scale data storage and may not adequately address privacy concerns. Swarm offers a compelling alternative by acting as a trustless, decentralized storage network: @@ -56,11 +49,9 @@ Swarm offers a compelling alternative by acting as a trustless, decentralized st This proposal aims to align with the Data Spaces Support Centre blueprint for a Technical Building Block covering Provenance & Traceability. By enabling easy uploading, downloading, and management of provenance files, the toolkit empowers users to meet emerging regulatory requirements, such as those outlined in the EU AI Act, and to establish trust and accountability in data-driven systems. - ## Specification - ### 1. Provenance Record Structure The provenance record will be stored as a single JSON file containing both metadata and the actual provenance data. The structure is: @@ -83,7 +74,6 @@ The provenance record will be stored as a single JSON file containing both metad *This structure ensures self-contained provenance records while maintaining compatibility with any standard. The `data` field can store provenance information in formats like JSON, XML, or binary.* - ### 2. Toolkit Features The toolkit interacts with Swarm to manage provenance records via a single JSON file: @@ -116,8 +106,6 @@ The toolkit interacts with Swarm to manage provenance records via a single JSON - **Optional Encryption**: Users may encrypt the raw provenance data before Base64 encoding. The `encryption` field can declare the method (e.g., `aes-256-gcm`), but key management is left to the user. - **No AI Screening**: Privacy checks (e.g., PII detection) are deferred to future enhancements or third-party services. - - ## Rationale @@ -147,10 +135,10 @@ This positions Swarm as a flexible and standards-compatible storage layer for pr ## Backwards Compatibility + This proposal does not introduce changes to Swarm’s core functionality or protocols. It leverages existing capabilities such as immutable storage and reference hashes, ensuring full compatibility with current implementations. It operates on top of the existing Swarm infrastructure and adheres to established file storage and retrieval methods. All existing Swarm tools (like the Bee CLI and Dashboard) remain fully compatible with this additional use case. - ## Test Cases @@ -166,7 +154,6 @@ Given that this is an informational SWIP, the test cases provided here are conce These test cases provide a high-level overview of the key functionalities of the proposed system and demonstrate its ability to store, retrieve, and manage provenance data on Swarm. - ## Implementation @@ -176,7 +163,7 @@ A prototype toolkit is being developed under the DataFund Fellowship with the fo - The toolkit will be implemented in Python. - It will provide command-line access to Swarm for uploading, downloading, and managing provenance files. - It will support the JSON-based provenance record structure as outlined in the Specification section. - - It will use the Bee client library to interact with the Swarm network. + - It will use the Bee client to interact with the Swarm network. - **Key Components**: - **Upload Module**: Handles the upload of provenance data to Swarm, including the preparation of the JSON metadata file and the selection of appropriate storage options (e.g., encryption). @@ -193,6 +180,5 @@ A prototype toolkit is being developed under the DataFund Fellowship with the fo This implementation aims to provide a practical and user-friendly solution for storing and managing provenance data on Swarm, while also laying the groundwork for future extensions and integrations. - ## Copyright Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). From 7050d29477bd23dad14003dc94e78d7c7e56a05e Mon Sep 17 00:00:00 2001 From: crtahlin Date: Thu, 13 Mar 2025 15:34:41 +0100 Subject: [PATCH 19/20] change color on diagram --- SWIPs/assets/swip-x-provenance/Provenance-diagram1.svg | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SWIPs/assets/swip-x-provenance/Provenance-diagram1.svg b/SWIPs/assets/swip-x-provenance/Provenance-diagram1.svg index 478eacd..e7b318a 100644 --- a/SWIPs/assets/swip-x-provenance/Provenance-diagram1.svg +++ b/SWIPs/assets/swip-x-provenance/Provenance-diagram1.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file From 64713eb243d18be4362c1f687dfc88dcfe1b6b68 Mon Sep 17 00:00:00 2001 From: crtahlin Date: Fri, 14 Mar 2025 15:32:17 +0100 Subject: [PATCH 20/20] add stamp management --- SWIPs/SWIP-draft_Data_Provenance.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/SWIPs/SWIP-draft_Data_Provenance.md b/SWIPs/SWIP-draft_Data_Provenance.md index 7ef0002..a9e7562 100644 --- a/SWIPs/SWIP-draft_Data_Provenance.md +++ b/SWIPs/SWIP-draft_Data_Provenance.md @@ -70,7 +70,7 @@ The provenance record will be stored as a single JSON file containing both metad - `provenance_standard`: Declares the standard used (e.g., DaTA, W3C PROV, or custom). - `encryption`: Optional field to indicate encryption method (default: `"none"`). - `data`: Base64-encoded provenance data (actual content in any format). -- `stamp_id`: Swarm stamp ID for TTL management. +- `stamp_id`: Swarm stamp ID used for TTL management, purchased based on the size of the JSON file and desired storage duration. *This structure ensures self-contained provenance records while maintaining compatibility with any standard. The `data` field can store provenance information in formats like JSON, XML, or binary.* @@ -81,9 +81,14 @@ The toolkit interacts with Swarm to manage provenance records via a single JSON - Action: Uploads the JSON file to Swarm. - Workflow: 1. User prepares provenance data in any format (e.g., DaTA spec, W3C PROV). - 2. Toolkit generates SHA-256 hash of raw data, encodes it to Base64, and wraps it into the JSON structure. - 3. JSON file is uploaded to Swarm via Bee node or gateway. + 2. Toolkit generates SHA-256 hash of raw data, encodes it to Base64, and wraps it into the JSON structure. + 3. The toolkit calculates the size of the JSON file to determine the required stamp size. + 4. The user specifies a desired TTL for storage, or a default value is used. + 5. The toolkit purchases a stamp from the Bee node or gateway using available funds. + 6. The returned stamp ID is used to upload the JSON file to Swarm. + 7. JSON file is uploaded to Swarm via Bee node or gateway. - Returns: Swarm reference hash for the JSON file. + - *Note: Acquiring funds (e.g., xBZZ for stamp purchase) is out of scope for this toolkit, it is assumed the node has the funds available.* - **Download**: - Action: Retrieves the JSON file using its Swarm reference hash. @@ -162,11 +167,12 @@ A prototype toolkit is being developed under the DataFund Fellowship with the fo - **Core Functionality**: - The toolkit will be implemented in Python. - It will provide command-line access to Swarm for uploading, downloading, and managing provenance files. + - It will support calculation of JSON size and purchase stamps based on user-specified or default TTL values. - It will support the JSON-based provenance record structure as outlined in the Specification section. - It will use the Bee client to interact with the Swarm network. - **Key Components**: - - **Upload Module**: Handles the upload of provenance data to Swarm, including the preparation of the JSON metadata file and the selection of appropriate storage options (e.g., encryption). + - **Upload Module**: Handles the upload of provenance data to Swarm, including the preparation of the JSON metadata file and the selection of appropriate storage options (e.g., appropriate stamp). - **Download Module**: Retrieves provenance data from Swarm, verifies data integrity using the content hash, and presents the data to the user. - **TTL Management Module**: Provides functionality to check the remaining storage duration (TTL) for a provenance file and extend the storage by topping up the associated stamp.