Python package for natural language pre-processing with nltk and Hunspell.
Includes:
- Standardizing cases
- Standardizing symbols
- Removing extra whitespaces
- Stopwords removal
- Simple spelling corrections
- Lemmatization
Available utilities:
clean_casessplit_camel_casedclean_invalid_symbolsclean_repeated_symbolsclean_spacesremove_stopwordsfix_spellingSpellCheckerlemmatizecleansoft_cleanfull_clean
Supported languages:
- Spanish
- English
Spell checking functions rely on dictionary files, placed by default on the dictionaries directory. This collection of dictionaries was added as a git submodule for convenience.
Lemmatization in Spanish relies on lemma dictionary files, placed by default on the lemmas directory. This collection was added as a git submodule for convenience. Feel free to propose your own!
To clone all submodules, use the following commands.
git submodule init
git submodule updateFurther reference can be found here.
The stopwords and wordnet corpus for the nltk package must be installed. A helper script is provided for easy setup. Simply run:
python setup.pyfrom textpreprocess.compound_cleaners.en import full_clean, soft_clean
text = ' thiss is a bery :''{ñdirti text! '
full_clean(text) # -> 'this very dirt text'
soft_clean(text) # -> 'this is a very dirty text'Special thanks to Vicente Oyanedel M. for his work on the first version of this package.