Home > Net >  Group by dataframe using python and pandas
Group by dataframe using python and pandas

Time:11-10

Lets say i have df like this

ID name_x st string
1 xx us Being unacquainted with the chief raccoon was harming his prospects for promotion
2 xy us1 The overpass went under the highway and into a secret world
3 xz us He was 100% into fasting with her until he understood that meant he couldn't eat
4 xu us2 Random words in front of other random words create a random sentence
5 xi us1 All you need to do is pick up the pen and begin

Using python and pandas for column st I want count name_x values and then extract top 3 key words from string.

For example like this:

st name_x_count top1_word top2_word top3_word
us 2 word1 word2 word3
us1 2 word1 word2 word3
us2 1 word1 word2 word3

Is there any way to solve this task?

CodePudding user response:

I would first groupby() to concatenate the strings as you show and then use collections Counter and then most_common. Finally assign it back to the dataframe. I am using x.lower() because otherwise "He" and "he" will be considered a different word (but you can always remove it if this is intended):

output = df.groupby('st').agg(
    name_x_count = pd.NamedAgg('name_x','count'),
    string = pd.NamedAgg('string',' '.join))

After grouping by we create the columns by using collections.Counter():

output[['top1_word','top2_word','top3_word']] = output['string'].map(lambda x: [x[0] for x in collections.Counter(x.lower().split()).most_common(3)])
output = output.drop(columns='string')

Output:

     name_x_count top1_word top2_word top3_word
st                                             
us              2        he      with       was
us1             2       the       and  overpass
us2             1    random     words        in

CodePudding user response:

First, I added a space at the end of each string, as we will combine sentences while grouping. Then I consolidated the sentences after grouping by the st column.

df['string']=df['string']   ' ' # we will use sum function. When combining sentences, there should be spaces in between.

dfx=df.groupby('st').agg({'st':'count','string':'sum'}) #groupby st and combine strings

Then list each word of the string expression, calculate their distribution and get the first 3 values.

from collections import Counter

mask=dfx['string'].apply(lambda x: list(dict(Counter(x.split()).most_common()[:3]).keys()))
print(mask)

'''
st  string
us  ['with', 'was', 'he']
us1 ['the', 'and', 'The']
us2 ['words', 'random', 'Random']


'''

Finally, add these first 3 words as new columns.

dfx[['top1_word','top2_word','top3_word']]=pd.DataFrame(mask.tolist(), index= mask.index)

dfx

st  name_x_count    top1_word   top2_word   top3_word
us  2               with        was         he
us1 2               the         and         The
us2 1               words       random      Random

  • Related