Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Changelog

All notable changes to this project will be documented in this file.

## [0.15] - 2025-12-18
- Prepare new PyPI release 0.15 (skipping 0.13/0.14 as requested).
- Improve Vault authentication: host-restricted token exchange and clearer errors.
- Add tests for Vault auth behavior.
- Add docstrings to increase docstring coverage for CI.

Note: After merging this branch, publish a PyPI release (version 0.15) so
`pip install databusclient` reflects the updated CLI behavior and bug fixes.
# Changelog

## 0.15 - Prepared release

- Prepare PyPI release 0.15.
- Restrict Vault token exchange to known hosts and provide clearer auth errors.
- Add tests for Vault auth behavior.
- Documentation: note about Vault-hosts and `--vault-token` usage.

(See PR and issue tracker for details.)
Comment on lines +13 to +22
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove duplicate changelog section.

Lines 13-22 duplicate the header and 0.15 release notes from lines 1-12. This appears to be accidental content duplication.

🔎 Apply this diff to remove the duplicate content:
 Note: After merging this branch, publish a PyPI release (version 0.15) so
 `pip install databusclient` reflects the updated CLI behavior and bug fixes.
-# Changelog
-
-## 0.15 - Prepared release
-
-- Prepare PyPI release 0.15.
-- Restrict Vault token exchange to known hosts and provide clearer auth errors.
-- Add tests for Vault auth behavior.
-- Documentation: note about Vault-hosts and `--vault-token` usage.
-
-(See PR and issue tracker for details.)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Changelog
## 0.15 - Prepared release
- Prepare PyPI release 0.15.
- Restrict Vault token exchange to known hosts and provide clearer auth errors.
- Add tests for Vault auth behavior.
- Documentation: note about Vault-hosts and `--vault-token` usage.
(See PR and issue tracker for details.)
# Changelog
## 0.15 - Prepared release
- Prepare PyPI release 0.15.
- Restrict Vault token exchange to known hosts and provide clearer auth errors.
- Add tests for Vault auth behavior.
- Documentation: note about Vault-hosts and `--vault-token` usage.
(See PR and issue tracker for details.)
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

13-13: Multiple headings with the same content

(MD024, no-duplicate-heading)

🤖 Prompt for AI Agents
In CHANGELOG.md around lines 13 to 22, the block repeating the header and "##
0.15 - Prepared release" release notes is a duplicate of lines 1-12; remove the
redundant lines 13–22 so the changelog contains only a single copy of the 0.15
entry. Ensure no other content or spacing is altered beyond deleting that
duplicate block and that the file ends with a single consolidated changelog
section.

8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,12 @@ Before using the client, install it via pip:
python3 -m pip install databusclient
```

Note: the PyPI release was updated and this repository prepares version `0.15`. If you previously installed `databusclient` via `pip` and observe different CLI behavior, upgrade to the latest release:

```bash
python3 -m pip install --upgrade databusclient==0.15
```

You can then use the client in the command line:

```bash
Expand Down Expand Up @@ -164,6 +170,8 @@ docker run --rm -v $(pwd):/data dbpedia/databus-python-client download $DOWNLOAD
- If no `--localdir` is provided, the current working directory is used as base directory. The downloaded files will be stored in the working directory in a folder structure according to the Databus layout, i.e. `./$ACCOUNT/$GROUP/$ARTIFACT/$VERSION/`.
- `--vault-token`
- If the dataset/files to be downloaded require vault authentication, you need to provide a vault token with `--vault-token /path/to/vault-token.dat`. See [Registration (Access Token)](#registration-access-token) for details on how to get a vault token.

Note: Vault tokens are only required for certain protected Databus hosts (for example: `data.dbpedia.io`, `data.dev.dbpedia.link`). The client now detects those hosts and will fail early with a clear message if a token is required but not provided. Do not pass `--vault-token` for public downloads.
- `--databus-key`
- If the databus is protected and needs API key authentication, you can provide the API key with `--databus-key YOUR_API_KEY`.

Expand Down
13 changes: 13 additions & 0 deletions databusclient/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,21 @@
"""Top-level package for the databus Python client.

This module exposes a small set of convenience functions and the CLI
entrypoint so the package can be used as a library or via
``python -m databusclient``.
"""

from databusclient import cli
from databusclient.api.deploy import create_dataset, create_distribution, deploy

__all__ = ["create_dataset", "deploy", "create_distribution"]


def run():
"""Start the Click CLI application.

This function is used by the ``__main__`` module and the package
entrypoint to invoke the command line interface.
"""

cli.app()
18 changes: 17 additions & 1 deletion databusclient/__main__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
"""Module used for ``python -m databusclient`` execution.

Runs the package's CLI application.
"""

from databusclient import cli

cli.app()

def main():
"""Invoke the CLI application.

Kept as a named function for easier testing and clarity.
"""

cli.app()


if __name__ == "__main__":
main()
27 changes: 27 additions & 0 deletions databusclient/api/delete.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
"""Helpers for deleting Databus resources via the Databus HTTP API.

This module provides utilities to delete groups, artifacts and versions on a
Databus instance using authenticated HTTP requests. The class `DeleteQueue`
also allows batching of deletions.
"""

import json
from typing import List

Expand All @@ -16,23 +23,43 @@ class DeleteQueue:
"""

def __init__(self, databus_key: str):
"""Create a DeleteQueue bound to a given Databus API key.

Args:
databus_key: API key used to authenticate deletion requests.
"""
self.databus_key = databus_key
self.queue: set[str] = set()

def add_uri(self, databusURI: str):
"""Add a single Databus URI to the deletion queue.

The URI will be deleted when `execute()` is called.
"""
self.queue.add(databusURI)

def add_uris(self, databusURIs: List[str]):
"""Add multiple Databus URIs to the deletion queue.

Args:
databusURIs: Iterable of full Databus URIs.
"""
for uri in databusURIs:
self.queue.add(uri)

def is_empty(self) -> bool:
"""Return True if the queue is empty."""
return len(self.queue) == 0

def is_not_empty(self) -> bool:
"""Return True if the queue contains any URIs."""
return len(self.queue) > 0

def execute(self):
"""Execute all queued deletions.

Each queued URI will be deleted using `_delete_resource`.
"""
for uri in self.queue:
print(f"[DELETE] {uri}")
_delete_resource(
Expand Down
42 changes: 42 additions & 0 deletions databusclient/api/deploy.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
"""Build and publish Databus datasets (JSON-LD) from provided metadata.

This module exposes helpers to create distribution strings, compute file
information (sha256 and size), construct dataset JSON-LD payloads and
publish them to a Databus instance using the Databus publish API.
"""

import hashlib
import json
from enum import Enum
Expand Down Expand Up @@ -25,6 +32,13 @@ class DeployLogLevel(Enum):


def _get_content_variants(distribution_str: str) -> Optional[Dict[str, str]]:
"""Parse content-variant key/value pairs from a distribution string.

The CLI supports passing a distribution as ``url|lang=en_type=parsed|...``.
This helper extracts the ``lang``/``type`` style key/value pairs as a
dictionary.
"""

args = distribution_str.split("|")

# cv string is ALWAYS at position 1 after the URL
Expand All @@ -50,6 +64,12 @@ def _get_content_variants(distribution_str: str) -> Optional[Dict[str, str]]:
def _get_filetype_definition(
distribution_str: str,
) -> Tuple[Optional[str], Optional[str]]:
"""Extract an explicit file format and compression from a distribution string.

Returns (file_extension, compression) where each may be ``None`` if the
format should be inferred from the URL path.
"""

file_ext = None
compression = None

Expand Down Expand Up @@ -87,6 +107,12 @@ def _get_filetype_definition(


def _get_extensions(distribution_str: str) -> Tuple[str, str, str]:
"""Return tuple `(extension_part, format_extension, compression)`.

``extension_part`` is the textual extension appended to generated
filenames (e.g. ".ttl.gz").
"""

extension_part = ""
format_extension, compression = _get_filetype_definition(distribution_str)

Expand Down Expand Up @@ -126,6 +152,11 @@ def _get_extensions(distribution_str: str) -> Tuple[str, str, str]:


def _get_file_stats(distribution_str: str) -> Tuple[Optional[str], Optional[int]]:
"""Parse an optional ``sha256sum:length`` tuple from a distribution string.

Returns (sha256sum, content_length) or (None, None) when not provided.
"""

metadata_list = distribution_str.split("|")[1:]
# check whether there is the shasum:length tuple separated by :
if len(metadata_list) == 0 or ":" not in metadata_list[-1]:
Expand All @@ -146,6 +177,12 @@ def _get_file_stats(distribution_str: str) -> Tuple[Optional[str], Optional[int]


def _load_file_stats(url: str) -> Tuple[str, int]:
"""Download the file at ``url`` and compute its SHA-256 and length.

This is used as a fallback when the caller did not supply checksum/size
information in the CLI or metadata file.
"""

resp = requests.get(url, timeout=30)
if resp.status_code >= 400:
raise requests.exceptions.RequestException(response=resp)
Expand All @@ -156,6 +193,11 @@ def _load_file_stats(url: str) -> Tuple[str, int]:


def get_file_info(distribution_str: str) -> Tuple[Dict[str, str], str, str, str, int]:
"""Return parsed file information for a distribution string.

Returns a tuple `(cvs, format_extension, compression, sha256sum, size)`.
"""

cvs = _get_content_variants(distribution_str)
extension_part, format_extension, compression = _get_extensions(distribution_str)

Expand Down
78 changes: 65 additions & 13 deletions databusclient/api/download.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import json
import os
from typing import List
from urllib.parse import urlparse

import requests
from SPARQLWrapper import JSON, SPARQLWrapper
Expand All @@ -12,6 +13,18 @@
)


# Hosts that require Vault token based authentication. Central source of truth.
VAULT_REQUIRED_HOSTS = {
"data.dbpedia.io",
"data.dev.dbpedia.link",
}


class DownloadAuthError(Exception):
"""Raised when an authorization problem occurs during download."""



def _download_file(
url,
localDir,
Expand Down Expand Up @@ -52,13 +65,23 @@ def _download_file(
os.makedirs(dirpath, exist_ok=True) # Create the necessary directories
# --- 1. Get redirect URL by requesting HEAD ---
headers = {}

# Determine hostname early and fail fast if this host requires Vault token.
# This prevents confusing 401/403 errors later and tells the user exactly
# what to do (provide --vault-token).
parsed = urlparse(url)
host = parsed.hostname
if host in VAULT_REQUIRED_HOSTS and not vault_token_file:
raise DownloadAuthError(
f"Vault token required for host '{host}', but no token was provided. Please use --vault-token."
)

# --- 1a. public databus ---
response = requests.head(url, timeout=30)
# --- 1b. Databus API key required ---
if response.status_code == 401:
# print(f"API key required for {url}")
if not databus_key:
raise ValueError("Databus API key not given for protected download")
raise DownloadAuthError("Databus API key not given for protected download")

headers = {"X-API-KEY": databus_key}
response = requests.head(url, headers=headers, timeout=30)
Expand All @@ -81,25 +104,54 @@ def _download_file(
response = requests.get(
url, headers=headers, stream=True, allow_redirects=True, timeout=30
)
www = response.headers.get(
"WWW-Authenticate", ""
) # Check if authentication is required
www = response.headers.get("WWW-Authenticate", "") # Check if authentication is required

# --- 3. If redirected to authentication 401 Unauthorized, get Vault token and retry ---
# --- 3. Handle authentication responses ---
# 3a. Server requests Bearer auth. Only attempt token exchange for hosts
# we explicitly consider Vault-protected (VAULT_REQUIRED_HOSTS). This avoids
# sending tokens to unrelated hosts and makes auth behavior predictable.
if response.status_code == 401 and "bearer" in www.lower():
print(f"Authentication required for {url}")
if not (vault_token_file):
raise ValueError("Vault token file not given for protected download")
# If host is not configured for Vault, do not attempt token exchange.
if host not in VAULT_REQUIRED_HOSTS:
raise DownloadAuthError(
"Server requests Bearer authentication but this host is not configured for Vault token exchange."
" Try providing a databus API key with --databus-key or contact your administrator."
)

# Host requires Vault; ensure token file provided.
if not vault_token_file:
raise DownloadAuthError(
f"Vault token required for host '{host}', but no token was provided. Please use --vault-token."
)

# --- 3a. Fetch Vault token ---
# TODO: cache token
# --- 3b. Fetch Vault token and retry ---
# Token exchange is potentially sensitive and should only be performed
# for known hosts. __get_vault_access__ handles reading the refresh
# token and exchanging it; errors are translated to DownloadAuthError
# for user-friendly CLI output.
vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
headers["Authorization"] = f"Bearer {vault_token}"
headers.pop("Accept-Encoding")
headers.pop("Accept-Encoding", None)

# --- 3b. Retry with token ---
# Retry with token
response = requests.get(url, headers=headers, stream=True, timeout=30)

# Map common auth failures to friendly messages
if response.status_code == 401:
raise DownloadAuthError("Vault token is invalid or expired. Please generate a new token.")
if response.status_code == 403:
raise DownloadAuthError("Vault token is valid but has insufficient permissions to access this file.")

# 3c. Generic forbidden without Bearer challenge
if response.status_code == 403:
raise DownloadAuthError("Access forbidden: your token or API key does not have permission to download this file.")

# 3d. Generic unauthorized without Bearer
if response.status_code == 401:
raise DownloadAuthError(
"Unauthorized: access denied. Check your --databus-key or --vault-token settings."
)

try:
response.raise_for_status() # Raise if still failing
except requests.exceptions.HTTPError as e:
Expand Down
24 changes: 18 additions & 6 deletions databusclient/api/utils.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
"""Utility helpers used by the API submodules.

Contains small parsing helpers and HTTP helpers that are shared by
`download`, `deploy` and `delete` modules.
"""

from typing import Optional, Tuple

import requests
Expand All @@ -24,23 +30,29 @@ def get_databus_id_parts_from_file_url(
A tuple containing (host, accountId, groupId, artifactId, versionId, fileId).
Each element is a string or None if not present.
"""
"""Split a Databus URI into its six parts.

The returned tuple is (host, accountId, groupId, artifactId, versionId, fileId).
Missing parts are returned as ``None``.
"""

uri = uri.removeprefix("https://").removeprefix("http://")
parts = uri.strip("/").split("/")
parts += [None] * (6 - len(parts)) # pad with None if less than 6 parts
return tuple(parts[:6]) # return only the first 6 parts


def fetch_databus_jsonld(uri: str, databus_key: str | None = None) -> str:
"""
Retrieve JSON-LD representation of a databus resource.
"""Fetch the JSON-LD representation of a Databus resource.

Parameters:
- uri: The full databus URI
- databus_key: Optional Databus API key for authentication on protected resources
Args:
uri: Full Databus resource URI.
databus_key: Optional API key for protected resources.

Returns:
JSON-LD string representation of the databus resource.
The response body as a string containing JSON-LD.
"""

headers = {"Accept": "application/ld+json"}
if databus_key is not None:
headers["X-API-KEY"] = databus_key
Expand Down
Loading