I am working on an NLP assignment and having some problems removing duplicated strings from a pandas column.
The data I am using is tagged, so some of the rows of data were repeated because the same comment could have multiple tags. So what I did was group the data by ID
and Comment
and aggregated based on tags, like so:
docs = docs.groupby(['ID2', 'comment']).agg({'tags':', '.join})
After grouping the data, the tags column had duplicates or more of the same tag. I have tried to remove the duplicated tags, to get unique tags, but have not been successful. First, I tried
docs['new_tags'] = (docs['tags'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))
but it did not remove the duplicated tags. So I tried a simple function to get the unique tags, but that was also not successful. The function is below:
def remove_multiples(txt):
tags = list()
for t in txt.split():
if not t in tags:
tags.append(t)
return ' '.join(tags)
docs['new_tags'] = docs['tags'].map(remove_multiples)
Sample data is below:
{'ID2': {0: '440', 1: '440', 2: '440', 3: '440', 4: '422', 5: '2422', 6: '422',
7: '422', 8: '422', 9: '422', 10: '422', 11: '422', 12: '422', 13: '422', 14: '422',
15: '422', 16: '422', 17: '422', 18: '422', 19: '422', 20: '422', 21: '422', 22: '422'},
'comment': {0: 'prompt', 1: 'prompt', 2: 'prompt', 3: 'prompt', 4: 'prompt',
5: 'prompt', 6: 'prompt', 7: 'great service', 8: 'great service', 9: 'great service',
10: 'friendly', 11: 'friendly', 12: 'friendly', 13: 'friendly', 14: 'fairly organized',
15: 'fairly organized', 16: 'fairly organized', 17: 'fairly organized',
18: 'fairly organized', 19: 'fairly organized', 20: 'fairly organized',
21: 'fairly organized', 22: 'fairly organized'},
'tags': {0: 'sp', 1: 'sp', 2: 'in', 3: 'ps', 4: 'wr', 5: 'sa', 6: 'sa', 7: 'sp',
8: 'gs', 9: 'po', 10: 'av', 11: 'hf', 12: 'cs', 13: 'fr', 14: 'gs', 15: 'ly',
16: 'drt', 17: 'co', 18: 'sp', 19: 'na', 20: 'ps', 21: 'ti', 22: 'ti'}}
CodePudding user response:
Is this what you want?
docs = (
docs.groupby(['ID2', 'comment'], as_index=False)
.agg({'tags':lambda tags: ', '.join(tags.unique())})
)
>>> docs
ID2 comment tags
0 422 fairly organized gs, ly, drt, co, sp, na, ps, ti
1 422 friendly av, hf, cs, fr
2 422 great service sp, gs, po
3 422 prompt wr, sa
4 440 prompt sp, in, ps