So, I have this dataframe in which there is an ID column, a TEXT column and a TOKEN column with the 3 most frequent words in the TEXT
ID TEXT TOKEN
sentence1 Emma Woodhouse , handsome , clever , and rich ... [(emma, 2), (woodhouse, 2), (handsome, 1)]
sentence2 She was the youngest of the two daughters of a... [(youngest, 1), (two, 1), (daughters, 2)]
sentence3 Her mother had died too long ago for her to ha... [(mother, 2), (died, 1), (long, 1)]
I want to convert the elements of each row in TOKEN column to a new row in a new dataframe. I have tried many ways but I am not able to get the token elements out of their column. The expected output would be like this:
WORD FREQ ID TEXT
emma 2 sentence1 Emma Woodhouse , handsome , clever , and rich ...
woodhouse 2 sentence1 Emma Woodhouse , handsome , clever , and rich ...
handsome 1 sentence1 Emma Woodhouse , handsome , clever , and rich ...
youngest 1 sentence2 She was the youngest of the two daughters of a...
two 1 sentence2 She was the youngest of the two daughters of a...
daughters 1 sentence2 She was the youngest of the two daughters of a...
I am beginning to think that it is not possible to do what I am looking for... can you help me? Thanks!
CodePudding user response:
Let us explode
and expand the TOKEN
column into new dataframe, then join
back with original dataframe
s = df.explode('TOKEN', ignore_index=True)
pd.DataFrame([*s.pop('TOKEN')], columns=['WORD', 'FREQ']).join(s)
WORD FREQ ID TEXT
0 emma 2 sentence1 Emma Woodhouse , handsome , clever , and rich ...
1 woodhouse 2 sentence1 Emma Woodhouse , handsome , clever , and rich ...
2 handsome 1 sentence1 Emma Woodhouse , handsome , clever , and rich ...
3 youngest 1 sentence2 She was the youngest of the two daughters of a...
4 two 1 sentence2 She was the youngest of the two daughters of a...
5 daughters 2 sentence2 She was the youngest of the two daughters of a...
6 mother 2 sentence3 Her mother had died too long ago for her to ha...
7 died 1 sentence3 Her mother had died too long ago for her to ha...
8 long 1 sentence3 Her mother had died too long ago for her to ha...
CodePudding user response:
You can explode the TOKEN column, then transform and create a dataframe out of it with the desired column names, then you can finally join it with the original dataframe columnwise:
pd.concat(
[df.TOKEN.explode().transform(pd.Series)
.rename(columns={0:'WORD', 1:'FREQ'}),
df.drop(columns="TOKEN")],
axis=1)
OUTPUT
WORD FREQ ID TEXT
0 emma 2 sentence1 Emma Woodhouse , handsome , clever ,...
0 woodhouse 2 sentence1 Emma Woodhouse , handsome , clever ,...
0 handsome 1 sentence1 Emma Woodhouse , handsome , clever ,...
1 youngest 1 sentence2 She was the youngest of the two daug...
1 two 1 sentence2 She was the youngest of the two daug...
1 daughters 2 sentence2 She was the youngest of the two daug...
2 mother 2 sentence3 Her mother had died too long ago for...
2 died 1 sentence3 Her mother had died too long ago for...
2 long 1 sentence3 Her mother had died too long ago for...
You can reset the index at last if you need to.