Clustering Documents with OpenAI, LangChain, and HDBSCAN
This article will teach you how to cluster text data with LLMs using cutting-edge tools.
In the past, the most common way to cluster documents was by building vectors with traditional Machine Learning methods such as bag-of-words or smaller pre-trained NLP models, like BERT, and then creating groups out of them. But LLMs have changed that.
While older methods are still relevant, if I had to cluster text data today, I'd start with OpenAI or Cohere's embeddings. It's faster, easier, and gives you additional goodies such as coming up with fitting titles for each cluster.
I haven't seen many tutorials on this topic, so I wrote one. In this tutorial, I'll show you how to cluster news articles using OpenAI embeddings, LangChain, and HDBSCAN.
Let's get to it!
Prerequisites
To make the most of this tutorial, you should be familiar with the following concepts:
- How to cluster text data using traditional ML methods.
- What are OpenAI Embeddings
- How HDBSCAN works
In addition, you'll need a free account at News API and an OpenAI account.
Set Up Your Local Environment
- Create a virtual environment using
venv
:python3.10 -m venv .venv
- Create a
requirements.txt
file that contains the following packages:hdbscan==0.8.29 ; python_version >= "3.10" and python_version < "4.0" langchain==0.0.194 ; python_version >= "3.10" and python_version < "4.0" openai==0.27.8 ; python_version >= "3.10" and python_version < "4.0" pandas==2.0.2 ; python_version >= "3.10" and python_version < "4.0" python-dotenv==1.0.0 ; python_version >= "3.10" and python_version < "4.0" tiktoken==0.4.0 ; python_version >= "3.10" and python_version < "4.0" newsapi-python==0.2.7 ; python_version >= "3.10" and python_version < "4.0" notebook==6.5.4 ; python_version >= "3.10" and python_version < "4.0"
- Activate the virtual environment and install the packages:
source .venv/bin/activate pip3 install -r requirements.txt
- Create a file called
.env
, and add the your OpenAI and NewsAPI key:OPENAI_SECRET_KEY=<your key> NEWSAPI_API_KEY=<your key>
- Create an empty notebook file. For the rest of this tutorial, you'll work on it.
Clustering Documents
You should think of the clustering process in four steps:
- Extract the text data you'll cluster. In our case, that is the latest news articles available in News API.
- Generate numerical vector representations of the documents using OpenAI's embedding capabilities.
- Apply a clustering algorithm on the vectors to group the documents.
- Generate a title for each cluster summarizing the articles contained in it.
That's it! Now, you'll see how that looks in practice.
Import the Required Packages
Start by importing the required Python libraries. Copy the following code in your notebook:
import os
import hdbscan
import pandas as pd
from langchain import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.prompts.chat import (
ChatPromptTemplate,
SystemMessagePromptTemplate,
HumanMessagePromptTemplate,
)
from newsapi import NewsApiClient
from dotenv import load_dotenv
load_dotenv()
This code imports the libraries you'll use throughout the tutorial. Here's the purpose of each one:
os
helps you read the environment variables.hdbscan
gives you a wrapper of HDBSCAN, the clustering algorithm you'll use to group the documents.langchain
provides you with a simple interface to interact with the OpenAI API.newsapi
makes it easy to interact with News API.dotenv
load the environment variables you define in.env
.
Next, you'll get a sample of news articles to cluster.
Get the Latest News Articles
You'll use the following code to get the latest 200 news articles using News API. Copy it to your notebook, and run the cell.
newsapi = NewsApiClient(api_key=os.getenv("NEWSAPI_API_KEY"))
sources_1 = [
"the-washington-post",
"the-wall-street-journal",
"business-insider",
]
sources_2 = [
"associated-press",
"bloomberg",
]
recent_articles = []
for source in [sources_1, sources_2]:
recent_articles.extend(newsapi.get_everything(
sources=",".join(source),
language="en",
page_size=100
)["articles"])
This code works as follows:
- Line 1 initializes the News API client.
- Lines 3 to 11 define the sources you'll use to find news articles. You need to split it into two groups because the API limits you to only 100 articles per request. So, in this case, you'll get 200 articles (100 for each of the source groups).
- Lines 15 to 20 use the client to get the articles with the sources you defined earlier.
Next, you'll generate embeddings, and cluster them.
Generate Embeddings and Cluster Documents
First, you'll start by generating embeddings from each of the news articles. With LangChain, that only takes a couple of lines of code:
docs = [
a["title"] + "\n\n" + a["description"]
for a in recent_articles
]
embeddings = OpenAIEmbeddings(chunk_size=1000).embed_documents(docs)
This code provides a numerical vector (embedding) for each of the news articles. It works as follows:
- Lines 1 to 4 extract the title and description from the articles for which you'll generate the embeddings.
- Line 6 initializes
OpenAIEmbeddings
and generates the vectors. In this case, you'll usetext-ada-embeddings-002
which is the default model used for this task, and you also limit the text chunks to 1000 tokens.
Once you have the embeddings, you can cluster them using hdbscan
:
hdb = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=3).fit(embeddings)
df = pd.DataFrame({
"title": [article["title"] for article in recent_articles],
"description": [article["description"] for article in recent_articles],
"cluster": hdb.labels_,
})
df = df.query("cluster != -1") # Remove documetns that are not in a cluster
This code will generate clusters using the embeddings generated, and then create a DataFrame with the results. It works as follows:
- Line 1 fits the
hdbscan
algorithm. In this case, I setmin_samples
andmin_cluster_size
to 3, but depending on your data this may change. Check HDBSCAN's documentation to learn more about these parameters. - Lines 3 to 8 create a DataFrame with the titles, descriptions, and corresponding cluster assignments of all the articles. Note that because HDBSCAN doesn't necessarily assign a cluster for each observation, you exclude the ones that weren't assigned to a cluster (denoted by a -1).
Next, you'll create topic titles for each cluster based on their contents.
Create a Topic Title per Cluster
For each cluster, you'll generate a topic title summarizing the articles in that cluster. Copy the following code to your notebook:
def get_prompt():
system_template = "You're an expert journalist. You're helping me write a compelling topic title for news articles."
human_template = "Using the following articles, write a topic title that summarizes them.\n\nARTICLES:{articles}\n\nTOPIC TITLE:"
return ChatPromptTemplate(
messages=[
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template(human_template),
],
input_variables=["articles"],
)
for c in df.cluster.unique():
chain = LLMChain(
llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"), prompt=get_prompt(), verbose=False
)
articles_str = "\n".join(
[
f"{article['title']}\n{article['description']}\n"
for article in df.query(f"cluster == {c}").to_dict(orient="records")
]
)
result = chain.run(
{
"articles": articles_str,
}
)
df.loc[df.cluster == c, "topic_title"] = result
This code takes all the articles per cluster and uses gpt-3.5-turbo
to generate a relevant topic title from them.
- Lines 1 to 11 define
get_prompt
, which is a function that you use to define the prompts used in the request togpt-3.5-turbo
. - Lines 13 to 28 go through each cluster, take the articles in it, and using the prompt defined earlier, generate a topic title for that cluster.
Finally, you can check the resulting clusters and topic titles, as follows:
c = 6
with pd.option_context("display.max_colwidth", None):
print(df.query(f"cluster == {c}").topic_title.values[0])
display(df.query(f"cluster == {c}").drop(columns=["topic_title"]).head())
In my case, running this code produces the following articles and topic titles:

All articles seem to be related to different aspects of the impact of AI, and the title seems like a good fit for that. Yay!
Conclusion
In this short tutorial, you've learned how to cluster documents using OpenAI embeddings, LangChain, and HDBSCAN.
I hope you find this useful. Let me know in the comments if you have any questions.
Check out the code on GitHub.