Retrieve the matching TFIDF of each words by sentence from a TFIDF matrix (pandas)-CodePudding

My first dataframe contains sentences I tokenized, the second is a matrix of all the TFIDF of each word in each sentence.

I'm trying to create a new column where only the TFIDF of the words in the sentence are stored. How can i do it ?

Tokenize sentences table

Index	Tokenized_string
1	[word1,word2,word3]
2	[word1,word3,word4]

Tfidf Table

Index	Word1	Word2	...
1	0.03	0.06	...
2	0.5	0.5	...

The table I'm trying to create

Index	Tokenized_string	TFIDF of each word
1	[word1,word2,word3]	[0.03,0.06,0.1]
2	[word1,word3,word4]	[0.5,0.4,0.2]

To create the dataframes in my exemple:

import pandas as pd
df = pd.DataFrame({ 'Tokenized_string': 
                   [['word1','word2','word3'],
                    ['word1','word3','word4']]
                   })
    
df_2 = pd.DataFrame({ 'Tokenized_string': 
                   [['word1','word2','word3'],
                    ['word1','word3','word4']],
                   'TFIDF of each word':
                       [[0.03,0.06,0.1],
                        [0.5,0.4,0.2]]})

CodePudding user response：

You can do that with the following.

Using the following tfidf_df as an example.

tfidf_df = pd.DataFrame({
    'Word1': [0.03, 0.5],
    'Word2': [0.06, 0.5],
    'Word3': [0.04, 0.5]
                   })

Note that you may need to change the tfidf_df variable based on your naming scheme

tfidf_df['TFIDF of each word'] = tfidf_df[sorted(tfidf_df.columns)].values.tolist()
df_2 = pd.concat([df, tfidf_df["TFIDF of each word"]], axis=1)

print(df_2)
        Tokenized_string  TFIDF of each word
0  [word1, word2, word3]  [0.03, 0.06, 0.04]
1  [word1, word3, word4]     [0.5, 0.5, 0.5]