-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Hello!
Curious if it would be possible to expose a regex token pattern param like that in CountVectorizer? This would help in filtering for (un)wanted chars during tokenization, e.g. hyphens, ampersands, apostrophes, etc.
The workaround I have found so far has been to use a custom POS tagger (custom_pos_tagger param of KeyphraseVectorizer) wherein I don't change any POS patterns/behaviour but recompile and modify the underlying spacy tokenizer object's prefix, suffix, and infix params. Wondering if there is a simpler way of exposing such behaviour? Keen to hear your thoughts!
Thanks
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request