-
Notifications
You must be signed in to change notification settings - Fork 9
Restrict Vault token exchange to specific hosts; improve auth errors; (Issue #19) #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restrict Vault token exchange to specific hosts; improve auth errors; (Issue #19) #40
Conversation
… add tests (fixes dbpedia#19)
… add tests and docs note (fixes dbpedia#19)
📝 WalkthroughWalkthroughAdds Vault-based authentication handling to downloads: new Changes
Sequence DiagramsequenceDiagram
participant Client as Client/CLI
participant Download as databusclient/api/download.py
participant Vault as Vault Service
participant DataHost as Protected Data Host
Client->>Download: api_download(url, vault_token_file?)
Download->>Download: parse URL -> host
Download->>DataHost: HEAD request (allow_redirects=False)
DataHost-->>Download: 401 (WWW-Authenticate: Bearer ...) or redirect
alt Host listed in VAULT_REQUIRED_HOSTS
Download->>Download: require vault token file (else raise DownloadAuthError)
end
alt HEAD/GET response is 401 with Bearer challenge
Download->>Download: read vault token file / REFRESH_TOKEN
Download->>Vault: POST exchange to obtain access token
Vault-->>Download: access token
Download->>DataHost: GET request with Authorization: Bearer <access token>
alt 200
DataHost-->>Download: payload (success)
Download-->>Client: return content
else 401
Download-->>Client: raise DownloadAuthError "invalid or expired"
else 403
Download-->>Client: raise DownloadAuthError "insufficient permissions"
end
else 401 on non-vault-configured host
Download-->>Client: raise DownloadAuthError (host not configured for Vault exchange)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Integer-Ctrl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rework the _download_file function. You can test the behavior locally by generating your own vault token, see the above comment for details on how to create one and which datasets require authentication. Also, ensure that the functionality for unprotected data remains.
databusclient/api/download.py
Outdated
| parsed = urlparse(url) | ||
| host = parsed.hostname | ||
| if host in VAULT_REQUIRED_HOSTS and not vault_token_file: | ||
| raise DownloadAuthError( | ||
| f"Vault token required for host '{host}', but no token was provided. Please use --vault-token." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client gets an databus uri as an argument which will be downloaded, e.g.:
databusclient download https://databus.dev.dbpedia.link/fhofer/live-fusion-kg-snapshot/geo-coordinates/2025-11-11/geo-coordinates_graph\=dbpedia-io_partition\=wgs-lat.ttl.gz --vault-token vault-token.dat
The databus sends a redirect uri back to the client which represents the actual download location. For the example above this is:
Download file: https://databus.dev.dbpedia.link/fhofer/live-fusion-kg-snapshot/geo-coordinates/2025-11-11/geo-coordinates_graph=dbpedia-io_partition=wgs-lat.ttl.gz
Redirects url: https://data.dbpedia.io/databus.dbpedia.org/dbpedia-enterprise/live-fusion-kg-snapshot/geo-coordinates/2025-11-11/geo-coordinates_graph=dbpedia-io_partition=wgs-lat.ttl.gz
Error: Server requests Bearer authentication but this host is not configured for Vault token exchange. Try providing a databus API key with --databus-key or contact your administrator.
So you have to check the redirect uri if authentication is required not the databus uri.
You can read here about how to get an access token so you can test with one of the KGs that need a vault token authentication, e.g. live-fusion-kg
| url = response.headers.get("Location") | ||
| print("Redirects url: ", url) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parsed = urlparse(url)
host = parsed.hostname
The host should be specified here, since it is almost certain to be overwritten by the redirect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
.gitignore (1)
3-3: Appropriate security addition to prevent credential leakage.Adding
vault-token.datto .gitignore prevents accidental commits of sensitive authentication tokens, which aligns with the PR's Vault-based auth implementation.Consider adding a brief inline comment to clarify the security rationale for future maintainers:
# project-specific tmp/ +# Vault authentication token (sensitive—should never be committed) vault-token.datThis is optional but improves clarity around why this particular file must be ignored.
databusclient/api/download.py (1)
141-141: Clarify: Why remove Accept-Encoding header before auth retry?The code removes the
Accept-Encodingheader before retrying with authentication. This header was originally set to"identity"(line 108-109) to disable gzip compression for accurate content-length tracking in the progress bar.Removing it before the authenticated retry means the server may return gzipped content, which could affect progress bar accuracy. Is this intentional? If so, a comment explaining the rationale would be helpful.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.gitignore(1 hunks)databusclient/api/download.py(5 hunks)tests/test_download.py(2 hunks)
🔇 Additional comments (5)
tests/test_download.py (1)
3-4: LGTM: Good practice to skip integration testsThe addition of the skip decorator for the integration test is appropriate. This prevents the test from running in CI/CD environments where live databus.dbpedia.org connectivity may not be available, while keeping the test code for manual validation when needed.
Also applies to: 30-30
databusclient/api/download.py (4)
4-4: LGTM: Clean authentication scaffoldingThe new imports, VAULT_REQUIRED_HOSTS configuration, and DownloadAuthError exception provide a solid foundation for host-based Vault authentication. Having a centralized set of vault-required hosts makes the authentication logic predictable and maintainable.
Also applies to: 16-24
90-105: LGTM: Clear authentication error handling for HEAD 401 responsesThe logic correctly distinguishes between vault-required hosts and other protected hosts, providing clear error messages when authentication credentials are missing. The fallback to X-API-KEY authentication for non-vault hosts is appropriate.
Note: This logic depends on the
hostvariable being correctly extracted from the final redirect URL (see previous comment on lines 85-88).
147-161: LGTM: User-friendly error messages for authentication failuresThe error handling provides clear, actionable messages that distinguish between different authentication failure scenarios:
- Invalid/expired tokens (401 after retry)
- Insufficient permissions (403 after retry)
- Generic access denied with configuration hints
This will significantly improve the user experience when authentication issues occur.
120-132: Use final URL from GET response for host-based authentication checkThe host variable is extracted from the HEAD response (line 87-88), but the subsequent GET request (line 111-113) with
allow_redirects=Truemay follow different redirects than the HEAD request. By default, the Python requests library does not follow redirects for HEAD requests, while GET requests do follow them. If the URLs differ after redirect handling, the authentication logic at line 122 may be checking against an incorrect host.Extract the host from
response.url(the GET response's final URL) before the authentication check at line 122 to ensure decisions are based on the actual download location.
| # Extract hostname from final URL (after redirect) to check if vault token needed. | ||
| # This is the actual download location that may require authentication. | ||
| parsed = urlparse(url) | ||
| host = parsed.hostname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Extract hostname from final URL after all redirects
Line 87 parses the url variable, which contains only the first redirect location (from line 80). However, the HEAD request on line 83 uses the default allow_redirects=True, meaning it may follow multiple redirect hops. The response.url attribute contains the final URL after all redirects, which is the actual download location that should be checked for authentication requirements.
Using the intermediate redirect URL instead of the final URL will cause authentication checks (lines 93, 122) to evaluate against the wrong host, potentially failing authentication for vault-required hosts or incorrectly attempting vault auth for non-vault hosts.
🔎 Proposed fix
- # Extract hostname from final URL (after redirect) to check if vault token needed.
- # This is the actual download location that may require authentication.
- parsed = urlparse(url)
- host = parsed.hostname
+ # Extract hostname from final URL (after all redirects) to check if vault token needed.
+ # response.url contains the final URL after the HEAD request followed all redirects.
+ parsed = urlparse(response.url)
+ host = parsed.hostnameBased on past review comments indicating host should be extracted from the redirect URL, not the original URL.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Extract hostname from final URL (after redirect) to check if vault token needed. | |
| # This is the actual download location that may require authentication. | |
| parsed = urlparse(url) | |
| host = parsed.hostname | |
| # Extract hostname from final URL (after all redirects) to check if vault token needed. | |
| # response.url contains the final URL after the HEAD request followed all redirects. | |
| parsed = urlparse(response.url) | |
| host = parsed.hostname |
🤖 Prompt for AI Agents
In databusclient/api/download.py around lines 85 to 88, the code parses the
original `url` (an intermediate redirect) instead of the final download
location; change the code to parse `response.url` (the final URL after the HEAD
request completes) and extract `hostname` from that parsed result, handling
cases where `response` or `response.url` may be None before accessing
`hostname`, so subsequent vault-auth checks use the actual final host.
| # for known hosts. __get_vault_access__ handles reading the refresh | ||
| # token and exchanging it; errors are translated to DownloadAuthError | ||
| # for user-friendly CLI output. | ||
| vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Pass final URL to token exchange function
Line 139 passes the url variable to __get_vault_access__, but this variable may contain an intermediate redirect location (set on line 80), not the final download URL. The GET request on lines 111-113 uses allow_redirects=True, so response.url contains the actual final URL after all redirects.
The __get_vault_access__ function extracts the hostname to use as the audience parameter for token exchange (lines 325-333 in the function). If an intermediate URL is passed instead of the final URL, the audience will be incorrect, causing token exchange to fail or produce a token for the wrong host.
🔎 Proposed fix
- vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
+ vault_token = __get_vault_access__(response.url, vault_token_file, auth_url, client_id)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id) | |
| vault_token = __get_vault_access__(response.url, vault_token_file, auth_url, client_id) |
🤖 Prompt for AI Agents
In databusclient/api/download.py around line 139 (caller of
__get_vault_access__), the code passes the possibly-intermediate variable `url`
to __get_vault_access__, but the HTTP GET on lines 111-113 used
allow_redirects=True so the final resolved URL is in `response.url`; pass
`response.url` (the final download URL) into __get_vault_access__ so the
function extracts the correct hostname/audience for the token exchange, ensuring
you reference the `response` variable in scope (and fall back to `url` only if
`response` is None).
Pull Request
Description
Implement host-based Vault token handling and clearer auth errors.
Related Issues
Issue #19
Type of change
Checklist
poetry run pytest- all tests passedpoetry run ruff check- no linting errorsSummary by CodeRabbit
New Features
Improvements
Documentation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.