I need the "Term"s and the "DocID"s for which the "DocFreq" is greater than 5. And I need to store it as a dictionary where the Term is the key and the "DocID"s separated by the comma make individual values for that key in a list.
For example, I need
{"Want to be with":[doc100.txt,doc8311.txt,...doc123.txt], "and has her own": [doc100.txt,doc9286.txt...doc23330.txt]....}
So far, I've got this:
df1 = df[(df['DocFreq'] > 5)][['Term','DocFreq','Ngram','DocID']]
But I can't get the format I need. Doing df.to_dict() gives me a dictionary of dictionaries that include column names and I don't want that.
Please help!! Thank you!!
CodePudding user response:
I would do something like that:
data = pd.DataFrame({'Term': ['a', 'b', 'c', 'd', 'e'],
'DocFreq': [1, 2, 2, 6, 7],
'Ngram': [4, 4, 4, 4, 4],
'DocId': ['11, 123, 222', '22', '33', '44, 303, doa0', '55, 9393, idid']
})
filt = data[data['DocFreq'] > 5][['Term', 'DocId']]
result = {row['Term']: row['DocId'] for _, row in filt.iterrows()}
CodePudding user response:
You are almost there. Just select DocID column before calling to_dict
.
You may use
dict1 = df.loc[(df['DocFreq'] > 5), ['Term','DocID']].set_index('Term')['DocID'].to_dict()