Skip to content

Expose regex token_pattern #20

@raj-shah

Description

@raj-shah

Hello!

Curious if it would be possible to expose a regex token pattern param like that in CountVectorizer? This would help in filtering for (un)wanted chars during tokenization, e.g. hyphens, ampersands, apostrophes, etc.

The workaround I have found so far has been to use a custom POS tagger (custom_pos_tagger param of KeyphraseVectorizer) wherein I don't change any POS patterns/behaviour but recompile and modify the underlying spacy tokenizer object's prefix, suffix, and infix params. Wondering if there is a simpler way of exposing such behaviour? Keen to hear your thoughts!

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions