Home > front end >  How to convert each element in a frequency column into a new dataframe row?
How to convert each element in a frequency column into a new dataframe row?

Time:07-11

So, I have this dataframe in which there is an ID column, a TEXT column and a TOKEN column with the 3 most frequent words in the TEXT

ID           TEXT                                               TOKEN
sentence1    Emma Woodhouse , handsome , clever , and rich ...  [(emma, 2), (woodhouse, 2), (handsome, 1)]
sentence2    She was the youngest of the two daughters of a...  [(youngest, 1), (two, 1), (daughters, 2)]
sentence3    Her mother had died too long ago for her to ha...  [(mother, 2), (died, 1), (long, 1)]

I want to convert the elements of each row in TOKEN column to a new row in a new dataframe. I have tried many ways but I am not able to get the token elements out of their column. The expected output would be like this:

WORD         FREQ    ID           TEXT                                
emma         2       sentence1    Emma Woodhouse , handsome , clever , and rich ...  
woodhouse    2       sentence1    Emma Woodhouse , handsome , clever , and rich ...  
handsome     1       sentence1    Emma Woodhouse , handsome , clever , and rich ...  
youngest     1       sentence2    She was the youngest of the two daughters of a... 
two          1       sentence2    She was the youngest of the two daughters of a...
daughters    1       sentence2    She was the youngest of the two daughters of a... 

I am beginning to think that it is not possible to do what I am looking for... can you help me? Thanks!

CodePudding user response:

Let us explode and expand the TOKEN column into new dataframe, then join back with original dataframe

s = df.explode('TOKEN', ignore_index=True)
pd.DataFrame([*s.pop('TOKEN')], columns=['WORD', 'FREQ']).join(s)

        WORD  FREQ         ID                                               TEXT
0       emma     2  sentence1  Emma Woodhouse , handsome , clever , and rich ...
1  woodhouse     2  sentence1  Emma Woodhouse , handsome , clever , and rich ...
2   handsome     1  sentence1  Emma Woodhouse , handsome , clever , and rich ...
3   youngest     1  sentence2  She was the youngest of the two daughters of a...
4        two     1  sentence2  She was the youngest of the two daughters of a...
5  daughters     2  sentence2  She was the youngest of the two daughters of a...
6     mother     2  sentence3  Her mother had died too long ago for her to ha...
7       died     1  sentence3  Her mother had died too long ago for her to ha...
8       long     1  sentence3  Her mother had died too long ago for her to ha...

CodePudding user response:

You can explode the TOKEN column, then transform and create a dataframe out of it with the desired column names, then you can finally join it with the original dataframe columnwise:

pd.concat(
    [df.TOKEN.explode().transform(pd.Series)
     .rename(columns={0:'WORD', 1:'FREQ'}), 
     df.drop(columns="TOKEN")],
axis=1)

OUTPUT

        WORD  FREQ         ID                                     TEXT
0       emma     2  sentence1  Emma Woodhouse , handsome , clever ,...
0  woodhouse     2  sentence1  Emma Woodhouse , handsome , clever ,...
0   handsome     1  sentence1  Emma Woodhouse , handsome , clever ,...
1   youngest     1  sentence2  She was the youngest of the two daug...
1        two     1  sentence2  She was the youngest of the two daug...
1  daughters     2  sentence2  She was the youngest of the two daug...
2     mother     2  sentence3  Her mother had died too long ago for...
2       died     1  sentence3  Her mother had died too long ago for...
2       long     1  sentence3  Her mother had died too long ago for...

You can reset the index at last if you need to.

  • Related