Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/run-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
platform:
- ubuntu-latest
- macos-latest
- windows-latest
# - windows-latest
runs-on: ${{ matrix.platform }}
name: Python ${{ matrix.python }}, ${{ matrix.platform }}
steps:
Expand Down
7 changes: 3 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Changelog

## Version 0.1 (development)
## Version 0.0.1

- Feature A added
- FIX: nasty bug #1729 fixed
- add your changes here!
- Initial implementation to access OrgDB objects.
- This also fetches the annotation hub sqlite file and queries for available org sqlite files instead of a static registry used in the txdb package.
109 changes: 106 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
[![PyPI-Server](https://img.shields.io/pypi/v/orgdb.svg)](https://pypi.org/project/orgdb/)
![Unit tests](https://github.com/YOUR_ORG_OR_USERNAME/orgdb/actions/workflows/run-tests.yml/badge.svg)
![Unit tests](https://github.com/BiocPy/orgdb/actions/workflows/run-tests.yml/badge.svg)

# orgdb

> Access OrgDB annotations
**OrgDb** provides an interface to access and query **Organism Database (OrgDb)** SQLite files in Python. It mirrors functionality from the R/Bioconductor `AnnotationDbi` package, enabling seamless integration of organism-wide gene annotation into Python workflows.

A longer description of your project goes here...
> [!NOTE]
>
> If you are looking to access TxDb databases, check out the [txdb package](https://www.github.com/biocpy/txdb).

## Install

Expand All @@ -15,6 +17,107 @@ To get started, install the package from [PyPI](https://pypi.org/project/orgdb/)
pip install orgdb
```

## Usage

### Using OrgDbRegistry

The registry download the AnnotationHub's metadata sqlite file and filters for all available OrgDb databases. You can fetch standard organism databases via the registry (backed by AnnotationHub).

```py
from orgdb import OrgDbRegistry

# Initialize registry and list available organisms
registry = OrgDbRegistry()
available = registry.list_orgdb()
print(available[:5])
# ["org.'Caballeronia_concitans'.eg", "org.'Chlorella_vulgaris'_C-169.eg", ...]

# Load the database for Homo sapiens (downloads and caches automatically)
db = registry.load_db("org.Hs.eg.db")
print(db.species)
# 'Homo sapiens'
```

### Inspecting metadata

Explore the available columns and key types in the database.

```py
# List available columns (and keytypes)
cols = db.columns()
print(cols[:5])
# ['ENTREZID', 'PFAM', 'IPI', 'PROSITE', 'ACCNUM']

# Check available keys for a specific keytype
entrez_ids = db.keys("ENTREZID")
print(entrez_ids[:5])
# ['1', '2', '9', '10', '11']
```

### Querying Annotations (using `select`)

The `select` method retrieves data as a `BiocFrame`. It automatically handles complex joins across tables.

```py
# Retrieve Gene Symbols and Gene Names for a list of Entrez IDs
res = db.select(
keys=["1", "10"],
columns=["SYMBOL", "GENENAME"],
keytype="ENTREZID"
)

print(res)
# BiocFrame with 2 rows and 3 columns
GENENAME ENTREZID SYMBOL
<list> <list> <list>
# [0] alpha-1-B glycoprotein 1 A1BG
# [1] N-acetyltransferase 2 10 NAT2

```

> [!NOTE]
>
> If you request "GO" columns, the result will automatically expand to include "EVIDENCE" and "ONTOLOGY" columns, matching Bioconductor behavior.

```py
go_res = db.select(
keys="1",
columns=["GO"],
keytype="ENTREZID"
)
# BiocFrame with 12 rows and 4 columns
ONTOLOGY ENTREZID GO EVIDENCE
<list> <list> <list> <list>
# [0] BP 1 GO:0002764 IBA
# [1] CC 1 GO:0005576 HDA
# [2] CC 1 GO:0005576 IDA
# ... ... ... ...
# [9] CC 1 GO:0070062 HDA
# [10] CC 1 GO:0072562 HDA
# [11] CC 1 GO:1904813 TAS
```

### Accessing Genomic Ranges

Extract gene coordinates as a `GenomicRanges` object (requires the `chromosome_locations` table in the OrgDb database).

```py
gr = db.genes()
print(gr)
# GenomicRanges with 52232 ranges and 1 metadata column
# seqnames ranges strand gene_id
# <str> <IRanges> <ndarray[int8]> <list>
# 1 19 -58345182 - -58336872 * | 1
# 2 12 -9067707 - -9019495 * | 2
# 2 12 -9067707 - -9019185 * | 2
# ... ... ... | ...
# 116804918 11 121024101 - 121191490 * | 116804918
# 117779438 1 20154213 - 20160568 * | 117779438
# 118142757 6 42155405 - 42180056 * | 118142757
# ------
# seqinfo(369 sequences): 1 10 10_GL383545v1_alt ... X_KI270913v1_alt Y Y_KZ208924v1_fix
```

<!-- biocsetup-notes -->

## Note
Expand Down
16 changes: 6 additions & 10 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,14 @@
# orgdb

Access OrgDB annotations
**OrgDb** provides an interface to access and query **Organism Database (OrgDb)** SQLite files in Python. It mirrors functionality from the R/Bioconductor `AnnotationDbi` package, enabling seamless integration of organism-wide gene annotation into Python workflows.

## Install

## Note

> This is the main page of your project's [Sphinx] documentation. It is
> formatted in [Markdown]. Add additional pages by creating md-files in
> `docs` or rst-files (formatted in [reStructuredText]) and adding links to
> them in the `Contents` section below.
>
> Please check [Sphinx] and [MyST] for more information
> about how to document your project and how to configure your preferences.
To get started, install the package from [PyPI](https://pypi.org/project/orgdb/)

```bash
pip install orgdb
```

## Contents

Expand Down
9 changes: 6 additions & 3 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ license = MIT
license_files = LICENSE.txt
long_description = file: README.md
long_description_content_type = text/markdown; charset=UTF-8; variant=GFM
url = https://github.com/pyscaffold/pyscaffold/
url = https://github.com/BiocPy/orgdb
# Add here related links, for example:
project_urls =
Documentation = https://pyscaffold.org/
# Source = https://github.com/pyscaffold/pyscaffold/
Documentation = https://github.com/BiocPy/orgdb
Source = https://github.com/BiocPy/orgdb
# Changelog = https://pyscaffold.org/en/latest/changelog.html
# Tracker = https://github.com/pyscaffold/pyscaffold/issues
# Conda-Forge = https://anaconda.org/conda-forge/pyscaffold
Expand Down Expand Up @@ -49,6 +49,9 @@ package_dir =
# For more information, check out https://semver.org/.
install_requires =
importlib-metadata; python_version<"3.8"
genomicranges
biocframe
pybiocfilecache


[options.packages.find]
Expand Down
6 changes: 6 additions & 0 deletions src/orgdb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,9 @@
__version__ = "unknown"
finally:
del version, PackageNotFoundError

from .orgdb import OrgDb
from .orgdbregistry import OrgDbRegistry
from .record import OrgDbRecord

__all__ = ["OrgDb", "OrgDbRegistry", "OrgDbRecord"]
31 changes: 31 additions & 0 deletions src/orgdb/_ahub.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
"""This list of OrgDB resources was generated from AnnotationHub.

Code to generate:

```bash
wget https://annotationhub.bioconductor.org/metadata/annotationhub.sqlite3
sqlite3 annotationhub.sqlite3
```

```sql
SELECT
r.title,
r.rdatadateadded,
lp.location_prefix || rp.rdatapath AS full_rdatapath
FROM resources r
LEFT JOIN location_prefixes lp
ON r.location_prefix_id = lp.id
LEFT JOIN rdatapaths rp
ON rp.resource_id = r.id
WHERE r.title LIKE 'org%.sqlite';
```

Note: we only keep the latest version of these files.

"""

__author__ = "Jayaram Kancherla"
__copyright__ = "Jayaram Kancherla"
__license__ = "MIT"

AHUB_METADATA_URL = "https://annotationhub.bioconductor.org/metadata/annotationhub.sqlite3"
Loading