Group by dataframe using python and pandas-CodePudding

Lets say i have df like this

ID	name_x	st	string
1	xx	us	Being unacquainted with the chief raccoon was harming his prospects for promotion
2	xy	us1	The overpass went under the highway and into a secret world
3	xz	us	He was 100% into fasting with her until he understood that meant he couldn't eat
4	xu	us2	Random words in front of other random words create a random sentence
5	xi	us1	All you need to do is pick up the pen and begin

Using python and pandas for column st I want count name_x values and then extract top 3 key words from string.

For example like this:

st	name_x_count	top1_word	top2_word	top3_word
us	2	word1	word2	word3
us1	2	word1	word2	word3
us2	1	word1	word2	word3

Is there any way to solve this task?

CodePudding user response：

I would first groupby() to concatenate the strings as you show and then use collections Counter and then most_common. Finally assign it back to the dataframe. I am using x.lower() because otherwise "He" and "he" will be considered a different word (but you can always remove it if this is intended):

output = df.groupby('st').agg(
    name_x_count = pd.NamedAgg('name_x','count'),
    string = pd.NamedAgg('string',' '.join))

After grouping by we create the columns by using collections.Counter():

output[['top1_word','top2_word','top3_word']] = output['string'].map(lambda x: [x[0] for x in collections.Counter(x.lower().split()).most_common(3)])
output = output.drop(columns='string')

Output:

     name_x_count top1_word top2_word top3_word
st                                             
us              2        he      with       was
us1             2       the       and  overpass
us2             1    random     words        in

CodePudding user response：

First, I added a space at the end of each string, as we will combine sentences while grouping. Then I consolidated the sentences after grouping by the st column.

df['string']=df['string']   ' ' # we will use sum function. When combining sentences, there should be spaces in between.

dfx=df.groupby('st').agg({'st':'count','string':'sum'}) #groupby st and combine strings

Then list each word of the string expression, calculate their distribution and get the first 3 values.

from collections import Counter

mask=dfx['string'].apply(lambda x: list(dict(Counter(x.split()).most_common()[:3]).keys()))
print(mask)

'''
st  string
us  ['with', 'was', 'he']
us1 ['the', 'and', 'The']
us2 ['words', 'random', 'Random']


'''

Finally, add these first 3 words as new columns.

dfx[['top1_word','top2_word','top3_word']]=pd.DataFrame(mask.tolist(), index= mask.index)

dfx

st  name_x_count    top1_word   top2_word   top3_word
us  2               with        was         he
us1 2               the         and         The
us2 1               words       random      Random