novelties-bookshare

Installation

uv

Clone the repository and use uv:

uv sync

By default this install everything for and the cuda version of PyTorch. We have several extras to configure your installation. For development dependencies:

uv sync --dev

For GPU acceleration, depending on your GPU, you can use:

# if you have a cuda 12.8 compatible GPU
uv sync --extra cu128
# if you have a rocm 6.3 compatible GPU
uv sync --extra rocm63

The cpu extra also exists for the cpu version of torch.

To not only use the library, but also add additional dependencies to reproduce experiments, you can use the experiments extra:

uv sync --extra experiments

guix

We provide a more reproducible environment using guix:

guix time-machine -C channels.scm -- shell -C -m manifest.scm

Library user guide

Encrypting your corpus

from novelties_bookshare.encrypt import encrypt_tokens

# assuming my_tokens is a list of tokens, and my_annotations is a list
# of single or multiple annotations (one or more annotations per
# token)
my_tokens, my_annotations = load_my_corpus()

# encrypt tokens with the desired hash length (2 is a solid default
# value)
encrypted_tokens = encrypt_tokens(my_tokens, hash_len=2)

with open("encrypted_corpus", "w") as f:
    for token, annotations in encrypt_tokens, my_annotations:
        f.write(f"{token} {annotations}\n")

Decrypting a shared encrypted corpus

Decrypting tokens is done using the novelties_bookshare.decrypt.decrypt_tokens function:

from novelties_bookshare.decrypt import decrypt_tokens

# let's suppose you wish to decrypt an encrypted corpus 
encrypted_source_tokens = load_source_tokens()
source_annotations = load_source_annotations()
# each token has one or more annotations
assert len(source_tokens) == len(source_annotations)

# let's suppose you have access to a degraded version of the tokens of
# the source corpus.
my_tokens = load_my_tokens()

# you can recover the source tokens using decrypt_tokens!
decrypted = decrypt_tokens(encrypted_source_tokens, my_tokens, hash_len=2)

The more the user tokens differ from the source tokens, the more errors will occur in the decryption process. It is possible to use additional decryption plugins to improve performance. Here is some examples:

from novelties_bookshare.decrypt import (
    make_plugin_propagate,
    make_pluging_mlm,
    make_plugin_split,
    make_plugin_case,
)

# Option #1: ligthweight but effective, using the propagate plugin alone
decrypted = decrypt_tokens(
    encrypted_source_tokens,
    my_tokens,
    hash_len=2,
    decryption_plugins=[make_plugin_propagate()],
)

# Option #2: heavier but more powerful, using a sequence of plugins
decrypted = decrypt_tokens(
    encrypted_source_tokens,
    my_tokens,
    hash_len=2,
    decryption_plugins=[
        make_plugin_propagate(),
        make_plugin_case(),
        make_plugin_split(max_token_len=16, max_splits_nb=8),
    ],
)

# Option #2: heaviest but the most powerful, using masked language
# modeling to end the sequence of plugins
decrypted = decrypt_tokens(
    encrypted_source_tokens,
    my_tokens,
    hash_len=2,
    decryption_plugins=[
        make_plugin_propagate(),
        make_plugin_case(),
        make_plugin_split(max_token_len=16, max_splits_nb=8),
        # if you have a GPU, you can pass device="cuda" for GPU
        # accelerated inference.
        make_plugin_mlm("answerdotai/ModernBERT-base", window=32)
    ],
)

Adding decryption plugins is, however, also increasing runtime. To reduce runtime, it is possible to take advantage of the fact that a dataset might be chunked. This can happen, for example, in the case of a book divided into chapters. Since novelties-bookshare uses difflib to align sequences which is O(n^2), it is usually noticeably faster to align chapters separately rather than aligning the whole document at once. The drawback is that one needs aligned chapters. novelties-bookshare support this usecase out of the box, as the decrypt_tokens function can take a list of tokens or a list of chunks:

from typing import Any
from novelties_bookshare.decrypt import decrypt_tokens

encrypted_source_chapters: list[list[str]] = load_source_chapters()
source_annotations list[list[Any]] = load_source_annotations()

my_chapters: list[list[str]] = load_my_chapters()

# decrypt_tokens supports list of chunks out of the box!
decrypted = decrypt_tokens(encrypted_source_chapters, my_chapters, hash_len=2)

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
data		data
novelties_bookshare		novelties_bookshare
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
all_xp.sh		all_xp.sh
channels.scm		channels.scm
manifest.scm		manifest.scm
plot_hash_collisions.py		plot_hash_collisions.py
plot_xp_edition.py		plot_xp_edition.py
plot_xp_edition_hash_len.py		plot_xp_edition_hash_len.py
plot_xp_synthetic_errors.py		plot_xp_synthetic_errors.py
pyproject.toml		pyproject.toml
scratch.org		scratch.org
scratch.py		scratch.py
uv.lock		uv.lock
xp_edition.py		xp_edition.py
xp_edition_mlm_params.py		xp_edition_mlm_params.py
xp_edition_ner_novelties.py		xp_edition_ner_novelties.py
xp_edition_propagate_order.py		xp_edition_propagate_order.py
xp_edition_split_params.py		xp_edition_split_params.py
xp_synthetic_errors.py		xp_synthetic_errors.py
xp_synthetic_errors_ocr.py		xp_synthetic_errors_ocr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

novelties-bookshare

Installation

uv

guix

Library user guide

Encrypting your corpus

Decrypting a shared encrypted corpus

About

Uh oh!

Releases

Packages

Languages

License

CompNet/novelties-bookshare

Folders and files

Latest commit

History

Repository files navigation

novelties-bookshare

Installation

uv

guix

Library user guide

Encrypting your corpus

Decrypting a shared encrypted corpus

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages