Expose regex token_pattern

Hello! 

Curious if it would be possible to expose a regex token pattern param like that in [CountVectorizer](https://github.com/scikit-learn/scikit-learn/blob/dc580a8ef/sklearn/feature_extraction/text.py#L993)? This would help in filtering for (un)wanted chars during tokenization, e.g. hyphens, ampersands, apostrophes, etc.

The workaround I have found so far has been to use a custom POS tagger (`custom_pos_tagger` param of KeyphraseVectorizer) wherein I don't change any POS patterns/behaviour but recompile and modify the underlying spacy tokenizer object's prefix, suffix, and infix params. Wondering if there is a simpler way of exposing such behaviour? Keen to hear your thoughts! 

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expose regex token_pattern #20

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Expose regex token_pattern #20

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions