Skip to content

CompNet/novelties-bookshare

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

novelties-bookshare

Installation

uv

Clone the repository and use uv:

uv sync

By default this install everything for and the cuda version of PyTorch. We have several extras to configure your installation. For development dependencies:

uv sync --dev

For GPU acceleration, depending on your GPU, you can use:

# if you have a cuda 12.8 compatible GPU
uv sync --extra cu128
# if you have a rocm 6.3 compatible GPU
uv sync --extra rocm63

The cpu extra also exists for the cpu version of torch.

To not only use the library, but also add additional dependencies to reproduce experiments, you can use the experiments extra:

uv sync --extra experiments

guix

We provide a more reproducible environment using guix:

guix time-machine -C channels.scm -- shell -C -m manifest.scm

Library user guide

Encrypting your corpus

from novelties_bookshare.encrypt import encrypt_tokens

# assuming my_tokens is a list of tokens, and my_annotations is a list
# of single or multiple annotations (one or more annotations per
# token)
my_tokens, my_annotations = load_my_corpus()

# encrypt tokens with the desired hash length (2 is a solid default
# value)
encrypted_tokens = encrypt_tokens(my_tokens, hash_len=2)

with open("encrypted_corpus", "w") as f:
    for token, annotations in encrypt_tokens, my_annotations:
        f.write(f"{token} {annotations}\n")

Decrypting a shared encrypted corpus

Decrypting tokens is done using the novelties_bookshare.decrypt.decrypt_tokens function:

from novelties_bookshare.decrypt import decrypt_tokens

# let's suppose you wish to decrypt an encrypted corpus 
encrypted_source_tokens = load_source_tokens()
source_annotations = load_source_annotations()
# each token has one or more annotations
assert len(source_tokens) == len(source_annotations)

# let's suppose you have access to a degraded version of the tokens of
# the source corpus.
my_tokens = load_my_tokens()

# you can recover the source tokens using decrypt_tokens!
decrypted = decrypt_tokens(encrypted_source_tokens, my_tokens, hash_len=2)

The more the user tokens differ from the source tokens, the more errors will occur in the decryption process. It is possible to use additional decryption plugins to improve performance. Here is some examples:

from novelties_bookshare.decrypt import (
    make_plugin_propagate,
    make_pluging_mlm,
    make_plugin_split,
    make_plugin_case,
)

# Option #1: ligthweight but effective, using the propagate plugin alone
decrypted = decrypt_tokens(
    encrypted_source_tokens,
    my_tokens,
    hash_len=2,
    decryption_plugins=[make_plugin_propagate()],
)

# Option #2: heavier but more powerful, using a sequence of plugins
decrypted = decrypt_tokens(
    encrypted_source_tokens,
    my_tokens,
    hash_len=2,
    decryption_plugins=[
        make_plugin_propagate(),
        make_plugin_case(),
        make_plugin_split(max_token_len=16, max_splits_nb=8),
    ],
)

# Option #2: heaviest but the most powerful, using masked language
# modeling to end the sequence of plugins
decrypted = decrypt_tokens(
    encrypted_source_tokens,
    my_tokens,
    hash_len=2,
    decryption_plugins=[
        make_plugin_propagate(),
        make_plugin_case(),
        make_plugin_split(max_token_len=16, max_splits_nb=8),
        # if you have a GPU, you can pass device="cuda" for GPU
        # accelerated inference.
        make_plugin_mlm("answerdotai/ModernBERT-base", window=32)
    ],
)

Adding decryption plugins is, however, also increasing runtime. To reduce runtime, it is possible to take advantage of the fact that a dataset might be chunked. This can happen, for example, in the case of a book divided into chapters. Since novelties-bookshare uses difflib to align sequences which is O(n^2), it is usually noticeably faster to align chapters separately rather than aligning the whole document at once. The drawback is that one needs aligned chapters. novelties-bookshare support this usecase out of the box, as the decrypt_tokens function can take a list of tokens or a list of chunks:

from typing import Any
from novelties_bookshare.decrypt import decrypt_tokens

encrypted_source_chapters: list[list[str]] = load_source_chapters()
source_annotations list[list[Any]] = load_source_annotations()

my_chapters: list[list[str]] = load_my_chapters()

# decrypt_tokens supports list of chunks out of the box!
decrypted = decrypt_tokens(encrypted_source_chapters, my_chapters, hash_len=2)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.0%
  • Other 1.0%