Skip to content

Conversation

@DhanashreePetare
Copy link
Contributor

@DhanashreePetare DhanashreePetare commented Dec 18, 2025

Pull Request

Description

Implement host-based Vault token handling and clearer auth errors.

  1. Only exchange Vault tokens for hosts listed in VAULT_REQUIRED_HOSTS.
  2. Fail early with a clear message when a Vault-required host is requested but no token provided.
  3. Convert 401/403 auth responses into human-readable CLI errors.
  4. Add unit tests covering missing token, invalid token (401), and insufficient permissions (403).

Related Issues
Issue #19

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • My code follows the ruff code style of this project.
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (if applicable)
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
    • poetry run pytest - all tests passed
    • poetry run ruff check - no linting errors

Summary by CodeRabbit

  • New Features

    • Vault token authentication added for protected download hosts.
  • Improvements

    • Early detection when a vault token is required and clearer, host-specific authentication messages.
    • Better handling and mapping of 401/403 download errors to user-friendly guidance.
  • Documentation

    • README clarified when vault tokens are required vs. not needed for public downloads.
  • Tests

    • Added unit tests for download/auth flows and a lightweight test shim; marked an integration download test as skipped.
  • Chores

    • Added vault-token.dat to .gitignore.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 18, 2025

📝 Walkthrough

Walkthrough

Adds Vault-based authentication handling to downloads: new VAULT_REQUIRED_HOSTS and DownloadAuthError, host-aware early validation, token-exchange flow for Bearer challenges, CLI mapping of auth errors to user-facing exceptions, tests for auth paths, README note, and a new gitignore entry.

Changes

Cohort / File(s) Change Summary
Documentation
README.md
Clarifies --vault-token usage: tokens required only for certain protected Databus hosts, client detects such hosts and fails early if a token is required but not provided; advises not to pass --vault-token for public downloads.
Core Authentication Implementation
databusclient/api/download.py
Adds VAULT_REQUIRED_HOSTS, DownloadAuthError, urlparse import; implements host-aware HEAD/GET handling, Vault token-exchange path for Bearer challenges, early validation for protected hosts, adjusted redirect/Accept-Encoding handling, and user-friendly 401/403 error mapping.
CLI Integration
databusclient/cli.py
Imports DownloadAuthError and wraps api_download invocation to catch auth errors and raise a ClickException with the error message for CLI-friendly reporting.
Test Infrastructure
tests/conftest.py
Adds a test-time shim that injects a lightweight fake SPARQLWrapper module with a DummySPARQL implementation when the real package is absent.
Authentication Tests
tests/test_download_auth.py
New unit tests covering Vault-required host behavior, non-vault host handling, token-exchange success/failure, and 401/403 error mappings via mocked requests.head, requests.get, and requests.post.
Test tweaks
tests/test_download.py
Adds a pytest import/blank line and marks test_with_collection with a skip decorator to treat it as an integration test.
Repository config
.gitignore
Adds vault-token.dat to .gitignore.

Sequence Diagram

sequenceDiagram
    participant Client as Client/CLI
    participant Download as databusclient/api/download.py
    participant Vault as Vault Service
    participant DataHost as Protected Data Host

    Client->>Download: api_download(url, vault_token_file?)
    Download->>Download: parse URL -> host
    Download->>DataHost: HEAD request (allow_redirects=False)
    DataHost-->>Download: 401 (WWW-Authenticate: Bearer ...) or redirect

    alt Host listed in VAULT_REQUIRED_HOSTS
        Download->>Download: require vault token file (else raise DownloadAuthError)
    end

    alt HEAD/GET response is 401 with Bearer challenge
        Download->>Download: read vault token file / REFRESH_TOKEN
        Download->>Vault: POST exchange to obtain access token
        Vault-->>Download: access token
        Download->>DataHost: GET request with Authorization: Bearer <access token>
        alt 200
            DataHost-->>Download: payload (success)
            Download-->>Client: return content
        else 401
            Download-->>Client: raise DownloadAuthError "invalid or expired"
        else 403
            Download-->>Client: raise DownloadAuthError "insufficient permissions"
        end
    else 401 on non-vault-configured host
        Download-->>Client: raise DownloadAuthError (host not configured for Vault exchange)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review databusclient/api/download.py for correct host parsing, redirect handling, Accept-Encoding adjustments, retry semantics, and precise error messages.
  • Validate tests/test_download_auth.py covers token-exchange success/failure and edge cases; ensure mocks accurately reflect real requests behavior.
  • Confirm CLI exception mapping in databusclient/cli.py preserves exit semantics and surfaces clear messages.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: restricting Vault token exchange to specific hosts and improving auth error handling, with reference to Issue #19.
Description check ✅ Passed The PR description covers all required template sections with clear objectives, related issues, type of change, and a completed checklist indicating tests and linting passed.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@Integer-Ctrl Integer-Ctrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rework the _download_file function. You can test the behavior locally by generating your own vault token, see the above comment for details on how to create one and which datasets require authentication. Also, ensure that the functionality for unprotected data remains.

Comment on lines 72 to 77
parsed = urlparse(url)
host = parsed.hostname
if host in VAULT_REQUIRED_HOSTS and not vault_token_file:
raise DownloadAuthError(
f"Vault token required for host '{host}', but no token was provided. Please use --vault-token."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client gets an databus uri as an argument which will be downloaded, e.g.:
databusclient download https://databus.dev.dbpedia.link/fhofer/live-fusion-kg-snapshot/geo-coordinates/2025-11-11/geo-coordinates_graph\=dbpedia-io_partition\=wgs-lat.ttl.gz --vault-token vault-token.dat

The databus sends a redirect uri back to the client which represents the actual download location. For the example above this is:

Download file: https://databus.dev.dbpedia.link/fhofer/live-fusion-kg-snapshot/geo-coordinates/2025-11-11/geo-coordinates_graph=dbpedia-io_partition=wgs-lat.ttl.gz
Redirects url:  https://data.dbpedia.io/databus.dbpedia.org/dbpedia-enterprise/live-fusion-kg-snapshot/geo-coordinates/2025-11-11/geo-coordinates_graph=dbpedia-io_partition=wgs-lat.ttl.gz
Error: Server requests Bearer authentication but this host is not configured for Vault token exchange. Try providing a databus API key with --databus-key or contact your administrator.

So you have to check the redirect uri if authentication is required not the databus uri.

You can read here about how to get an access token so you can test with one of the KGs that need a vault token authentication, e.g. live-fusion-kg

Comment on lines 97 to 99
url = response.headers.get("Location")
print("Redirects url: ", url)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsed = urlparse(url)
host = parsed.hostname

The host should be specified here, since it is almost certain to be overwritten by the redirect.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
.gitignore (1)

3-3: Appropriate security addition to prevent credential leakage.

Adding vault-token.dat to .gitignore prevents accidental commits of sensitive authentication tokens, which aligns with the PR's Vault-based auth implementation.

Consider adding a brief inline comment to clarify the security rationale for future maintainers:

# project-specific
tmp/
+# Vault authentication token (sensitive—should never be committed)
vault-token.dat

This is optional but improves clarity around why this particular file must be ignored.

databusclient/api/download.py (1)

141-141: Clarify: Why remove Accept-Encoding header before auth retry?

The code removes the Accept-Encoding header before retrying with authentication. This header was originally set to "identity" (line 108-109) to disable gzip compression for accurate content-length tracking in the progress bar.

Removing it before the authenticated retry means the server may return gzipped content, which could affect progress bar accuracy. Is this intentional? If so, a comment explaining the rationale would be helpful.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c92e58f and 81b895b.

📒 Files selected for processing (3)
  • .gitignore (1 hunks)
  • databusclient/api/download.py (5 hunks)
  • tests/test_download.py (2 hunks)
🔇 Additional comments (5)
tests/test_download.py (1)

3-4: LGTM: Good practice to skip integration tests

The addition of the skip decorator for the integration test is appropriate. This prevents the test from running in CI/CD environments where live databus.dbpedia.org connectivity may not be available, while keeping the test code for manual validation when needed.

Also applies to: 30-30

databusclient/api/download.py (4)

4-4: LGTM: Clean authentication scaffolding

The new imports, VAULT_REQUIRED_HOSTS configuration, and DownloadAuthError exception provide a solid foundation for host-based Vault authentication. Having a centralized set of vault-required hosts makes the authentication logic predictable and maintainable.

Also applies to: 16-24


90-105: LGTM: Clear authentication error handling for HEAD 401 responses

The logic correctly distinguishes between vault-required hosts and other protected hosts, providing clear error messages when authentication credentials are missing. The fallback to X-API-KEY authentication for non-vault hosts is appropriate.

Note: This logic depends on the host variable being correctly extracted from the final redirect URL (see previous comment on lines 85-88).


147-161: LGTM: User-friendly error messages for authentication failures

The error handling provides clear, actionable messages that distinguish between different authentication failure scenarios:

  • Invalid/expired tokens (401 after retry)
  • Insufficient permissions (403 after retry)
  • Generic access denied with configuration hints

This will significantly improve the user experience when authentication issues occur.


120-132: Use final URL from GET response for host-based authentication check

The host variable is extracted from the HEAD response (line 87-88), but the subsequent GET request (line 111-113) with allow_redirects=True may follow different redirects than the HEAD request. By default, the Python requests library does not follow redirects for HEAD requests, while GET requests do follow them. If the URLs differ after redirect handling, the authentication logic at line 122 may be checking against an incorrect host.

Extract the host from response.url (the GET response's final URL) before the authentication check at line 122 to ensure decisions are based on the actual download location.

Comment on lines +85 to +88
# Extract hostname from final URL (after redirect) to check if vault token needed.
# This is the actual download location that may require authentication.
parsed = urlparse(url)
host = parsed.hostname
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Extract hostname from final URL after all redirects

Line 87 parses the url variable, which contains only the first redirect location (from line 80). However, the HEAD request on line 83 uses the default allow_redirects=True, meaning it may follow multiple redirect hops. The response.url attribute contains the final URL after all redirects, which is the actual download location that should be checked for authentication requirements.

Using the intermediate redirect URL instead of the final URL will cause authentication checks (lines 93, 122) to evaluate against the wrong host, potentially failing authentication for vault-required hosts or incorrectly attempting vault auth for non-vault hosts.

🔎 Proposed fix
-    # Extract hostname from final URL (after redirect) to check if vault token needed.
-    # This is the actual download location that may require authentication.
-    parsed = urlparse(url)
-    host = parsed.hostname
+    # Extract hostname from final URL (after all redirects) to check if vault token needed.
+    # response.url contains the final URL after the HEAD request followed all redirects.
+    parsed = urlparse(response.url)
+    host = parsed.hostname

Based on past review comments indicating host should be extracted from the redirect URL, not the original URL.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Extract hostname from final URL (after redirect) to check if vault token needed.
# This is the actual download location that may require authentication.
parsed = urlparse(url)
host = parsed.hostname
# Extract hostname from final URL (after all redirects) to check if vault token needed.
# response.url contains the final URL after the HEAD request followed all redirects.
parsed = urlparse(response.url)
host = parsed.hostname
🤖 Prompt for AI Agents
In databusclient/api/download.py around lines 85 to 88, the code parses the
original `url` (an intermediate redirect) instead of the final download
location; change the code to parse `response.url` (the final URL after the HEAD
request completes) and extract `hostname` from that parsed result, handling
cases where `response` or `response.url` may be None before accessing
`hostname`, so subsequent vault-auth checks use the actual final host.

# for known hosts. __get_vault_access__ handles reading the refresh
# token and exchanging it; errors are translated to DownloadAuthError
# for user-friendly CLI output.
vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Pass final URL to token exchange function

Line 139 passes the url variable to __get_vault_access__, but this variable may contain an intermediate redirect location (set on line 80), not the final download URL. The GET request on lines 111-113 uses allow_redirects=True, so response.url contains the actual final URL after all redirects.

The __get_vault_access__ function extracts the hostname to use as the audience parameter for token exchange (lines 325-333 in the function). If an intermediate URL is passed instead of the final URL, the audience will be incorrect, causing token exchange to fail or produce a token for the wrong host.

🔎 Proposed fix
-        vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
+        vault_token = __get_vault_access__(response.url, vault_token_file, auth_url, client_id)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
vault_token = __get_vault_access__(response.url, vault_token_file, auth_url, client_id)
🤖 Prompt for AI Agents
In databusclient/api/download.py around line 139 (caller of
__get_vault_access__), the code passes the possibly-intermediate variable `url`
to __get_vault_access__, but the HTTP GET on lines 111-113 used
allow_redirects=True so the final resolved URL is in `response.url`; pass
`response.url` (the final download URL) into __get_vault_access__ so the
function extracts the correct hostname/audience for the token exchange, ensuring
you reference the `response` variable in scope (and fall back to `url` only if
`response` is None).

@Integer-Ctrl Integer-Ctrl merged commit 9d3a64c into dbpedia:main Dec 21, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants