Sample dataset I have looks like this:
Language | Count |
---|---|
Russian | 1000 |
English | 1500 |
Spanish | 500 |
Arabic,Russian, English, Spanish | 2 |
Arabic, English | 15 |
I want it to transform so that the result looks like this:
Language | Count |
---|---|
Russian | 1002 |
English | 1517 |
Spanish | 502 |
Arabic | 17 |
So what happened is that, I parsed rows that contained more than one language. Added up them to languages that were already given. If it was not given (in this case: Arabic) created the new one.
How can i achieve this? Thank you!
CodePudding user response:
Use DataFrame.assign
with Series.str.split
, DataFrame.explode
and last aggregate sum:
df = (df.assign(Language=df.Language.str.split(','))
.explode('Language')
.groupby('Language', as_index=False, sort=False)
.sum())
print (df)
Language Count
0 Russian 1002
1 English 1517
2 Spanish 502
3 Arabic 17
CodePudding user response:
You can split
, explode
, and groupby.sum
:
out = (df
.assign(Language=df['Language'].str.split(',\s*'))
.explode('Language')
.groupby('Language', as_index=False, sort=False).sum()
)
output:
Language Count
0 Russian 1002
1 English 1517
2 Spanish 502
3 Arabic 17