text-preprocess

Python package for natural language pre-processing with nltk and Hunspell.

Includes:

Standardizing cases
Standardizing symbols
Removing extra whitespaces
Stopwords removal
Simple spelling corrections
Lemmatization

Available utilities:

clean_cases
split_camel_cased
clean_invalid_symbols
clean_repeated_symbols
clean_spaces
remove_stopwords
fix_spelling
SpellChecker
lemmatize
clean
soft_clean
full_clean

Supported languages:

Spanish
English

Submodules

Spell checking functions rely on dictionary files, placed by default on the dictionaries directory. This collection of dictionaries was added as a git submodule for convenience.

Lemmatization in Spanish relies on lemma dictionary files, placed by default on the lemmas directory. This collection was added as a git submodule for convenience. Feel free to propose your own!

To clone all submodules, use the following commands.

git submodule init
git submodule update

Further reference can be found here.

Setup

The stopwords and wordnet corpus for the nltk package must be installed. A helper script is provided for easy setup. Simply run:

python setup.py

Sample usage

from textpreprocess.compound_cleaners.en import full_clean, soft_clean

text = '   thiss is a bery :''{ñdirti text!  '

full_clean(text) # -> 'this very dirt text'

soft_clean(text) # -> 'this is a very dirty text'

Special thanks to Vicente Oyanedel M. for his work on the first version of this package.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github		.github
compound_cleaners		compound_cleaners
lemmas @ 17a9e3d		lemmas @ 17a9e3d
lemmatizer		lemmatizer
spell_checker		spell_checker
stopword_remover		stopword_remover
symbol_cleaner		symbol_cleaner
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
case_cleaner.py		case_cleaner.py
requirements.txt		requirements.txt
settings.py		settings.py
setup.py		setup.py
spaces_cleaner.py		spaces_cleaner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

text-preprocess

Submodules

Setup

Sample usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

fcoclavero/textpreprocess

Folders and files

Latest commit

History

Repository files navigation

text-preprocess

Submodules

Setup

Sample usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages