Returns a dataframe with a list of files containing the word-CodePudding

I have a dataframe:

business049.txt  [bmw, cash, fuel, mini, product, less, mini]
business470.txt  [saudi, investor, pick, savoy, london, famou]
business075.txt  [eu, minist, mull, jet, fuel, tax, european]
business101.txt  [australia, rate, australia, rais, benchmark]
business060.txt  [insur, boss, plead, guilti, anoth, us, insur]

Therefore, I would like the output to include a column of words and a column of filenames that contain it. It should be like:

bmw          [business049.txt,business055.txt]
australia    [business101.txt,business141.txt]

Thank you

CodePudding user response：

This is quite possibly not the most efficient/best way to do this, but here you go:

# Create DataFrame from question
df = pd.DataFrame({
    'txt_file': ['business049.txt',
            'business470.txt',
            'business075.txt',
            'business101.txt',
            'business060.txt',
            ],
    'words': [
        ['bmw', 'cash', 'fuel', 'mini', 'product', 'less', 'mini'],
        ['saudi', 'investor', 'pick', 'savoy', 'london', 'famou'],
        ['eu', 'minist', 'mull', 'jet', 'fuel', 'tax', 'european'],
        ['australia', 'rate', 'australia', 'rais', 'benchmark'],
        ['insur', 'boss', 'plead', 'guilti', 'anoth', 'us', 'insur'],
    ]
})

# Get all unique words in a list
word_list = list(set(df['words'].explode()))

# Link txt files to unique words
# Note: list of txt files is one string comma separated to ensure single column in resulting DataFrame
word_dict = {
    unique_word: [', '.join(df[df['words'].apply(lambda list_of_words: unique_word in list_of_words)]['txt_file'])] for unique_word in word_list
}

# Create DataFrame from dictionary (transpose to have words as row index).
words_in_files = pd.DataFrame(word_dict).transpose()

The dictionary word_dict might already be exactly what you need instead of holding on to a DataFrame just for the sake of using a DataFrame. If that is the case, remove the ', '.join() part from the dictionary creation, because it doesn't matter that the values of your dict are unequal in length.