Table of Contents
- How to use
- Code snippets
- Cleaning text
- Lowercase text
- Remove cases (useful for caseles matching)
- Remove hyperlinks
- Remove <a> tags but keep its content
- Remove HTML tags
- Remove extra spaces, tabs, and line breaks
- Remove punctuation
- Remove numbers
- Remove digits
- Remove non-alphabetic words
- Remove all special characters and punctuation
- Remove stopwords
- Remove short tokens
- Transform emojis to characters
- NLTK
- spaCy
- Keras
- Cleaning text
Every time I start a new project, I promise myself that I'll save the most useful code snippets for the future. I never keep that promise.
The old ways are too compelling. I end up copying code from old projects, looking for the same questions in Stack Overflow, or reviewing the same Kaggle notebooks for the hundredth time. At this point, I don't know how many times I've googled for a variant of "remove extra spaces in a string using Python."
So, finally, I've decided to compile snippets and small recipes for frequent tasks. I'm starting with Natural Language Processing (NLP) because I've been involved in several projects in that area in the last few years. And because I like it ;)
For now, I'm planning on compiling code snippets and recipes for the following tasks:
- Cleaning and tokenizing text (this article)
- Clustering documents
- Classifying text
This article contains 20 code snippets you can use to clean and tokenize text using Python. I'll continue adding new ones whenever I find something useful. They're based on a mix of Stack Overflow answers, books, and my own experience.
In the next section, you can see an example of how to use the code snippets. Then, you can check the snippets on your own and take the ones that you need.
How to use
I'd recommend you to combine the snippets you need into a function. Then, you can use that function for pre-processing or tokenizing text. If you're using pandas you can apply that function to a specific column using the .map
method of pandas' Series
.
Take a look at the example below:
import re
import pandas as pd
from string import punctuation
df = pd.DataFrame({
"text_col": [
"This TEXT needs \t\t\tsome cleaning!!!...",
"This text too!!... ",
"Yes, you got it right!\n This one too\n"
]
})
def preprocess_text(text):
text = text.lower() # Lowercase text
text = re.sub(f"[{re.escape(punctuation)}]", "", text) # Remove punctuation
text = " ".join(text.split()) # Remove extra spaces, tabs, and new lines
return text
df["text_col"].map(preprocess_text)
Code snippets
Before testing the snippets, make sure to copy the following function at the top of your Python script or Jupyter notebook.
def print_text(sample, clean):
print(f"Before: {sample}")
print(f"After: {clean}")
Cleaning text
These are functions you can use to clean text using Python. Most of them use just the Python's standard libraries like re
or string
.
Lowercase text
sample_text = "THIS TEXT WILL BE LOWERCASED. THIS WON'T: ßßß"
clean_text = sample_text.lower()
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: THIS TEXT WILL BE LOWERCASED. THIS WON'T: ßßß
# After: this text will be lowercased. this won't: ßßß
Remove cases (useful for caseles matching)
sample_text = "THIS TEXT WILL BE LOWERCASED. THIS too: ßßß"
clean_text = sample_text.casefold()
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: THIS TEXT WILL BE LOWERCASED. THIS too: ßßß
# After: this text will be lowercased. this too: ssssss
Remove hyperlinks
import re
sample_text = "Some URLs: https://example.com http://example.io http://exam-ple.com More text"
clean_text = re.sub(r"https?://\S+", "", sample_text)
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: Some URLs: https://example.com http://example.io http://exam-ple.com More text
# After: Some URLs: More text
Remove <a> tags but keep its content
import re
sample_text = "Here's <a href='https://example.com'> a tag</a>"
clean_text = re.sub(r"<a[^>]*>(.*?)</a>", r"\1", sample_text)
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: Here's <a href='https://example.com'> a tag</a>
# After: Here's a tag
Remove HTML tags
import re
sample_text = """
<body>
<div> This is a sample text with <b>lots of tags</b> </div>
<br/>
</body>
"""
clean_text = re.sub(r"<.*?>", " ", sample_text)
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before:
# <body>
# <div> This is a sample text with <b>lots of tags</b> </div>
# <br/>
# </body>
# After:
# This is a sample text with lots of tags
Remove extra spaces, tabs, and line breaks
sample_text = " \t\tA text\t\t\t\n\n sample "
clean_text = " ".join(sample_text.split())
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: A text
# sample
# After: A text sample
Remove punctuation
import re
from string import punctuation
sample_text = "A lot of !!!! .... ,,,, ;;;;;;;?????"
clean_text = re.sub(f"[{re.escape(punctuation)}]", "", sample_text)
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: A lot of !!!! .... ,,,, ;;;;;;;?????
# After: A lot of
Remove numbers
import re
sample_text = "Remove these numbers: 1919191 2229292 11.233 22/22/22. But don't remove this one H2O"
clean_text = re.sub(r"\b[0-9]+\b\s*", "", sample_text)
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: Remove these numbers: 1919191 2229292 11.233 22/22/22. But don't remove this one H2O
# After: Remove these numbers: .//. But don't remove this one H2O
Remove digits
sample_text = "I want to keep this one: 10/10/20 but not this one 222333"
clean_text = " ".join([w for w in sample_text.split() if not w.isdigit()]) # Side effect: removes extra spaces
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: I want to keep this one: 10/10/20 but not this one 222333
# After: I want to keep this one: 10/10/20 but not this one
Remove non-alphabetic characters
sample_text = "Sample text with numbers 123455 and words"
clean_text = " ".join([w for w in sample_text.split() if w.isalpha()]) # Side effect: removes extra spaces
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: Sample text with numbers 123455 and words
# After: Sample text with numbers and words
Remove all special characters and punctuation
import re
sample_text = "Sample text 123 !!!! Haha.... !!!! ##$$$%%%%"
clean_text = re.sub(r"[^A-Za-z0-9\s]+", "", sample_text)
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: Sample text 123 !!!! Haha.... !!!! ##$$$%%%%
# After: Sample text 123 Haha
Remove stopwords
stopwords = ["is", "a"]
sample_text = "this is a sample text"
tokens = sample_text.split()
clean_tokens = [t for t in tokens if not t in stopwords]
clean_text = " ".join(clean_tokens)
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: this is a sample text
# After: this sample text
Remove short tokens
sample_text = "this is a sample text. I'll remove the a"
tokens = sample_text.split()
clean_tokens = [t for t in tokens if len(t) > 1]
clean_text = " ".join(clean_tokens)
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: this is a sample text. I'll remove the a
# After: this is sample text. I'll remove the
Transform emojis to characters
from emoji import demojize
sample_text = "I love 🥑"
clean_text = demojize(sample_text)
print_text(sample_text, clean_text)
# ----- Expected output -----
# Before: I love 🥑
# After: I love :avocado:
NLTK
Before using NLTK's snippets, you need to install NLTK. You can do that as follows: pip install nltk
.
Tokenize text using NLTK
from nltk.tokenize import word_tokenize
sample_text = "this is a text ready to tokenize"
tokens = word_tokenize(sample_text)
print_text(sample_text, tokens)
# ----- Expected output -----
# Before: this is a text ready to tokenize
# After: ['this', 'is', 'a', 'text', 'ready', 'to', 'tokenize']
Tokenize tweets using NLTK
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer()
sample_text = "This is a tweet @jack #NLP"
tokens = tweet_tokenizer.tokenize(sample_text)
print_text(sample_text, tokens)
# ----- Expected output -----
# Before: This is a tweet @jack #NLP
# After: ['This', 'is', 'a', 'tweet', '@jack', '#NLP']
Split text into sentences using NLTK
from nltk.tokenize import sent_tokenize
sample_text = "This is a sentence. This is another one!\nAnd this is the last one."
sentences = sent_tokenize(sample_text)
print_text(sample_text, sentences)
# ----- Expected output -----
# Before: This is a sentence. This is another one!
# And this is the last one.
# After: ['This is a sentence.', 'This is another one!', 'And this is the last one.']
spaCy
Before using spaCy's snippets, you need to install the library as follows: pip install spacy
. You also need to download a language model. For English, here's how you do it: python -m spacy download en_core_web_sm
.
Tokenize text using spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
sample_text = "this is a text ready to tokenize"
doc = nlp(sample_text)
tokens = [token.text for token in doc]
print_text(sample_text, tokens)
# ----- Expected output -----
# Before: this is a text ready to tokenize
# After: ['this', 'is', 'a', 'text', 'ready', 'to', 'tokenize']
Split text into sentences using spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
sample_text = "This is a sentence. This is another one!\nAnd this is the last one."
doc = nlp(sample_text)
sentences = [sentence.text for sentence in doc.sents]
print_text(sample_text, sentences)
# ----- Expected output -----
# Before: This is a sentence. This is another one!
# And this is the last one.
# After: ['This is a sentence.', 'This is another one!\n', 'And this is the last one.']
Keras
Before using Keras' snippets, you need to install the library as follows: pip install tensorflow && pip install keras
.
Tokenize text using Keras
from keras.preprocessing.text import text_to_word_sequence
sample_text = 'This is a text you want to tokenize using KERAS!!'
tokens = text_to_word_sequence(sample_text)
print_text(sample_text, tokens)
# ----- Expected output -----
# Before: This is a text you want to tokenize using KERAS!!
# After: ['this', 'is', 'a', 'text', 'you', 'want', 'to', 'tokenize', 'using', 'keras']