Home > Blockchain >  Map BERTopic topic IDs back to the training dataframe
Map BERTopic topic IDs back to the training dataframe

Time:11-16

I have trained a BERTopic model on a dataframe of length of 400k. I want to map the topics of each document in a new column inside the dataframe. I could do that by running a for loop on all the documents and do topic_model.transform(doc) on them. The only problem is, it takes more than a second to transform each document into its topic and it would take days for the whole dataset.

Is there a way to achieve this faster since I want to map the topics on the training data.

I tried:

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)

topics = []
for text in df.texts:
    tops = topic_model.transform(text)
    topics.append(tops)
df['topics'] = topics

CodePudding user response:

There is no need to recalculate the topics as you already retrieved them when using .fit_transform. There, the topics that you retrieve are in the exact same order as the input documents. Therefore, you can perform the following:

# The `topics` that you get here are in the exact same order as `docs`
# `topics[0]` belongs to `docs[0]`, `topics[1]` to `docs[1]`, etc.
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)

# When you used `.fit_transform`:
df = pd.DataFrame({"Document": docs, "Topic": topic})

For those using .fit instead of .fit_transform, you can also access the topics and their documents as follows:

# When you used `.fit`:
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})

CodePudding user response:

From the source code, the transform() function of the BERTopic class is able accept a list of documents -- so you don't need to loop over your dataframe calling transform() multiple times for each document.

Secondly, it seems that if you don't pass your pre-trained document embeddings to the transform() function, embeddings will be set to None and you'll be calling _extract_embeddings() every single time which is likely what is causing the poor performance. The solution is to pass the embeddings to your transform() call. In the dummy example shown below, this improves speed of classification of 1,000 documents by approx. 1,555x (68.43 vs 0.044 seconds).

Example

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.datasets import fetch_20newsgroups
import random
import pandas as pd

# Create dummy data
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
random.seed(756)
training_docs = random.sample(docs, 1000)
testing_docs = random.sample(docs, 1000)

# Instantiate and fit topic model to training docs
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(training_docs, show_progress_bar=True)
topic_model = BERTopic().fit(training_docs, embeddings)
topic_model.reduce_topics(training_docs, nr_topics=5)  # Reduce num of topics, default = 20

# Determine topics on testing docs
topics, probs = topic_model.transform(testing_docs, embeddings)
# topics, probs = topic_model.transform(testing_docs)  # ~1,555x slower
df = pd.DataFrame({"docs": testing_docs, "topics": topics})
print(df)
print(topic_model.get_topic_info())
  • Related