I have a dataframe:
business049.txt [bmw, cash, fuel, mini, product, less, mini]
business470.txt [saudi, investor, pick, savoy, london, famou]
business075.txt [eu, minist, mull, jet, fuel, tax, european]
business101.txt [australia, rate, australia, rais, benchmark]
business060.txt [insur, boss, plead, guilti, anoth, us, insur]
Therefore, I would like the output to include a column of words and a column of filenames that contain it. It should be like:
bmw [business049.txt,business055.txt]
australia [business101.txt,business141.txt]
Thank you
CodePudding user response:
This is quite possibly not the most efficient/best way to do this, but here you go:
# Create DataFrame from question
df = pd.DataFrame({
'txt_file': ['business049.txt',
'business470.txt',
'business075.txt',
'business101.txt',
'business060.txt',
],
'words': [
['bmw', 'cash', 'fuel', 'mini', 'product', 'less', 'mini'],
['saudi', 'investor', 'pick', 'savoy', 'london', 'famou'],
['eu', 'minist', 'mull', 'jet', 'fuel', 'tax', 'european'],
['australia', 'rate', 'australia', 'rais', 'benchmark'],
['insur', 'boss', 'plead', 'guilti', 'anoth', 'us', 'insur'],
]
})
# Get all unique words in a list
word_list = list(set(df['words'].explode()))
# Link txt files to unique words
# Note: list of txt files is one string comma separated to ensure single column in resulting DataFrame
word_dict = {
unique_word: [', '.join(df[df['words'].apply(lambda list_of_words: unique_word in list_of_words)]['txt_file'])] for unique_word in word_list
}
# Create DataFrame from dictionary (transpose to have words as row index).
words_in_files = pd.DataFrame(word_dict).transpose()
The dictionary word_dict
might already be exactly what you need instead of holding on to a DataFrame just for the sake of using a DataFrame. If that is the case, remove the ', '.join()
part from the dictionary creation, because it doesn't matter that the values of your dict are unequal in length.