Clustering Documents with OpenAI, LangChain, and HDBSCAN

This article will teach you how to cluster text data with LLMs using cutting-edge tools.

Clustering Documents with OpenAI, LangChain, and HDBSCAN
Photo by Pierre Bamin / Unsplash

In the past, the most common way to cluster documents was by building vectors with traditional Machine Learning methods such as bag-of-words or smaller pre-trained NLP models, like BERT, and then creating groups out of them. But LLMs have changed that.

While older methods are still relevant, if I had to cluster text data today, I'd start with OpenAI or Cohere's embeddings. It's faster, easier, and gives you additional goodies such as coming up with fitting titles for each cluster.

I haven't seen many tutorials on this topic, so I wrote one. In this tutorial, I'll show you how to cluster news articles using OpenAI embeddings, LangChain, and HDBSCAN.

Let's get to it!

Prerequisites

To make the most of this tutorial, you should be familiar with the following concepts:

In addition, you'll need a free account at News API and an OpenAI account.

Set Up Your Local Environment

  1. Create a virtual environment using venv:
    python3.10 -m venv .venv
    
  2. Create a requirements.txt file that contains the following packages:
    hdbscan==0.8.29 ; python_version >= "3.10" and python_version < "4.0"
    langchain==0.0.194 ; python_version >= "3.10" and python_version < "4.0"
    openai==0.27.8 ; python_version >= "3.10" and python_version < "4.0"
    pandas==2.0.2 ; python_version >= "3.10" and python_version < "4.0"
    python-dotenv==1.0.0 ; python_version >= "3.10" and python_version < "4.0"
    tiktoken==0.4.0 ; python_version >= "3.10" and python_version < "4.0"
    newsapi-python==0.2.7 ; python_version >= "3.10" and python_version < "4.0"
    notebook==6.5.4 ; python_version >= "3.10" and python_version < "4.0" 
    
  3. Activate the virtual environment and install the packages:
    source .venv/bin/activate
    pip3 install -r requirements.txt
    
  4. Create a file called .env, and add the your OpenAI and NewsAPI key:
    OPENAI_SECRET_KEY=<your key>
    NEWSAPI_API_KEY=<your key>
    
  5. Create an empty notebook file. For the rest of this tutorial, you'll work on it.

Clustering Documents

You should think of the clustering process in four steps:

  1. Extract the text data you'll cluster. In our case, that is the latest news articles available in News API.
  2. Generate numerical vector representations of the documents using OpenAI's embedding capabilities.
  3. Apply a clustering algorithm on the vectors to group the documents.
  4. Generate a title for each cluster summarizing the articles contained in it.

That's it! Now, you'll see how that looks in practice.

Import the Required Packages

Start by importing the required Python libraries. Copy the following code in your notebook:

import os

import hdbscan
import pandas as pd

from langchain import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from newsapi import NewsApiClient

from dotenv import load_dotenv

load_dotenv()

This code imports the libraries you'll use throughout the tutorial. Here's the purpose of each one:

  • os helps you read the environment variables.
  • hdbscan gives you a wrapper of HDBSCAN, the clustering algorithm you'll use to group the documents.
  • langchain provides you with a simple interface to interact with the OpenAI API.
  • newsapi makes it easy to interact with News API.
  • dotenv load the environment variables you define in .env.

Next, you'll get a sample of news articles to cluster.

Get the Latest News Articles

You'll use the following code to get the latest 200 news articles using News API. Copy it to your notebook, and run the cell.

newsapi = NewsApiClient(api_key=os.getenv("NEWSAPI_API_KEY"))

sources_1 = [
    "the-washington-post",
    "the-wall-street-journal",
    "business-insider",
]
sources_2 = [
    "associated-press",
    "bloomberg",
]

recent_articles = []

for source in [sources_1, sources_2]:
    recent_articles.extend(newsapi.get_everything(
        sources=",".join(source),
        language="en",
        page_size=100
    )["articles"])

This code works as follows:

  • Line 1 initializes the News API client.
  • Lines 3 to 11 define the sources you'll use to find news articles. You need to split it into two groups because the API limits you to only 100 articles per request. So, in this case, you'll get 200 articles (100 for each of the source groups).
  • Lines 15 to 20 use the client to get the articles with the sources you defined earlier.

Next, you'll generate embeddings, and cluster them.

Generate Embeddings and Cluster Documents

First, you'll start by generating embeddings from each of the news articles. With LangChain, that only takes a couple of lines of code:

docs = [
    a["title"] + "\n\n" + a["description"]
    for a in recent_articles
]

embeddings = OpenAIEmbeddings(chunk_size=1000).embed_documents(docs)

This code provides a numerical vector (embedding) for each of the news articles. It works as follows:

  • Lines 1 to 4 extract the title and description from the articles for which you'll generate the embeddings.
  • Line 6 initializes OpenAIEmbeddings and generates the vectors. In this case, you'll use text-ada-embeddings-002 which is the default model used for this task, and you also limit the text chunks to 1000 tokens.

Once you have the embeddings, you can cluster them using hdbscan:

hdb = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=3).fit(embeddings)

df = pd.DataFrame({
    "title": [article["title"] for article in recent_articles],
    "description": [article["description"] for article in recent_articles],
    "cluster": hdb.labels_,
})
df = df.query("cluster != -1") # Remove documetns that are not in a cluster

This code will generate clusters using the embeddings generated, and then create a DataFrame with the results. It works as follows:

  • Line 1 fits the hdbscan algorithm. In this case, I set min_samples and min_cluster_size to 3, but depending on your data this may change. Check HDBSCAN's documentation to learn more about these parameters.
  • Lines 3 to 8 create a DataFrame with the titles, descriptions, and corresponding cluster assignments of all the articles. Note that because HDBSCAN doesn't necessarily assign a cluster for each observation, you exclude the ones that weren't assigned to a cluster (denoted by a -1).

Next, you'll create topic titles for each cluster based on their contents.

Create a Topic Title per Cluster

For each cluster, you'll generate a topic title summarizing the articles in that cluster. Copy the following code to your notebook:

def get_prompt():
    system_template = "You're an expert journalist. You're helping me write a compelling topic title for news articles."
    human_template = "Using the following articles, write a topic title that summarizes them.\n\nARTICLES:{articles}\n\nTOPIC TITLE:"

    return ChatPromptTemplate(
        messages=[
            SystemMessagePromptTemplate.from_template(system_template),
            HumanMessagePromptTemplate.from_template(human_template),
        ],
        input_variables=["articles"],
    )

for c in df.cluster.unique():
    chain = LLMChain(
        llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"), prompt=get_prompt(), verbose=False
    )
    articles_str = "\n".join(
        [
            f"{article['title']}\n{article['description']}\n"
            for article in df.query(f"cluster == {c}").to_dict(orient="records")
        ]
    )
    result = chain.run(
        {
            "articles": articles_str,
        }
    )
    df.loc[df.cluster == c, "topic_title"] = result

This code takes all the articles per cluster and uses gpt-3.5-turbo to generate a relevant topic title from them.

  • Lines 1 to 11 define get_prompt, which is a function that you use to define the prompts used in the request to gpt-3.5-turbo.
  • Lines 13 to 28 go through each cluster, take the articles in it, and using the prompt defined earlier, generate a topic title for that cluster.

Finally, you can check the resulting clusters and topic titles, as follows:

c = 6
with pd.option_context("display.max_colwidth", None):
    print(df.query(f"cluster == {c}").topic_title.values[0])
    display(df.query(f"cluster == {c}").drop(columns=["topic_title"]).head())

In my case, running this code produces the following articles and topic titles:

All articles seem to be related to different aspects of the impact of AI, and the title seems like a good fit for that. Yay!

Conclusion

In this short tutorial, you've learned how to cluster documents using OpenAI embeddings, LangChain, and HDBSCAN.

I hope you find this useful. Let me know in the comments if you have any questions.

Check out the code on GitHub.