How to convert a dataframe to a dictionary when dataframe column has commas-CodePudding

I have a csv as follows:

I need the "Term"s and the "DocID"s for which the "DocFreq" is greater than 5. And I need to store it as a dictionary where the Term is the key and the "DocID"s separated by the comma make individual values for that key in a list.

For example, I need

{"Want to be with":[doc100.txt,doc8311.txt,...doc123.txt], "and has her own": [doc100.txt,doc9286.txt...doc23330.txt]....}

So far, I've got this:

df1 = df[(df['DocFreq'] > 5)][['Term','DocFreq','Ngram','DocID']]

But I can't get the format I need. Doing df.to_dict() gives me a dictionary of dictionaries that include column names and I don't want that.

Please help!! Thank you!!

CodePudding user response：

I would do something like that:

data = pd.DataFrame({'Term': ['a', 'b', 'c', 'd', 'e'], 
 'DocFreq': [1, 2, 2, 6, 7], 
 'Ngram': [4, 4, 4, 4, 4], 
 'DocId': ['11, 123, 222', '22', '33', '44, 303, doa0', '55, 9393, idid']
})
 

filt = data[data['DocFreq'] > 5][['Term', 'DocId']]
result = {row['Term']: row['DocId'] for _, row in filt.iterrows()}

CodePudding user response：

You are almost there. Just select DocID column before calling to_dict. You may use

dict1 = df.loc[(df['DocFreq'] > 5), ['Term','DocID']].set_index('Term')['DocID'].to_dict()