Home > database >  I created a TF-IDF code to analyze an annual report, I want to know the importance of specific keywo
I created a TF-IDF code to analyze an annual report, I want to know the importance of specific keywo

Time:05-15

import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
import path
import re



with open(r'C:\Users\maxim\PycharmProjects\THESIS\data\santander2020_1.txt', 'r') as file:
    data = file.read()

dataset = [data]


tfIdfVectorizer=TfidfVectorizer(use_idf=True, stop_words="english"
                                , lowercase=True,max_features=100,ngram_range=(1,3))
tfIdf = tfIdfVectorizer.fit_transform(dataset)
df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)




print (df.head(25))

The above code is what ive created to do a TF-IDF analysis on an annual report, however currently it is giving me the values of the most important words within the report. However, I only need the TFIDF values for the keywords ["digital","hardware","innovation","software","analytics","data","digitalisation","technology"], is there a way I can specify to only look for the tfidf values of these terms?

I'm very new to programming with little experience, I'm doing this for my thesis.

Any help is greatly appreciated.

CodePudding user response:

You have defined tfIdf as tfIdf = tfIdfVectorizer.fit_transform(dataset).

So tfIdf.toarray() would be a 2-D array, where each row refers to a document and each element in the row refers to the TF-IDF score of the corresponding word. To know what word each element is representing, you could use the .get_feature_names() function which would print a list of words. Then you can use this information to create a mapping (dict) from words to scores, like this:

wordScores = dict(zip(tfIdfVectorizer.get_feature_names(), tfIdf.toarray()[0]))

Now suppose your document contains the word "digital" and you want to know its TF-IDF score, you could simply print the value of wordScores["digital"].

  • Related