Lets say i have df like this
ID | name_x | st | string |
---|---|---|---|
1 | xx | us | Being unacquainted with the chief raccoon was harming his prospects for promotion |
2 | xy | us1 | The overpass went under the highway and into a secret world |
3 | xz | us | He was 100% into fasting with her until he understood that meant he couldn't eat |
4 | xu | us2 | Random words in front of other random words create a random sentence |
5 | xi | us1 | All you need to do is pick up the pen and begin |
Using python and pandas for column st I want count name_x values and then extract top 3 key words from string.
For example like this:
st | name_x_count | top1_word | top2_word | top3_word |
---|---|---|---|---|
us | 2 | word1 | word2 | word3 |
us1 | 2 | word1 | word2 | word3 |
us2 | 1 | word1 | word2 | word3 |
Is there any way to solve this task?
CodePudding user response:
I would first groupby() to concatenate the strings as you show and then use collections Counter
and then most_common
. Finally assign it back to the dataframe. I am using x.lower()
because otherwise "He" and "he" will be considered a different word (but you can always remove it if this is intended):
output = df.groupby('st').agg(
name_x_count = pd.NamedAgg('name_x','count'),
string = pd.NamedAgg('string',' '.join))
After grouping by we create the columns by using collections.Counter()
:
output[['top1_word','top2_word','top3_word']] = output['string'].map(lambda x: [x[0] for x in collections.Counter(x.lower().split()).most_common(3)])
output = output.drop(columns='string')
Output:
name_x_count top1_word top2_word top3_word
st
us 2 he with was
us1 2 the and overpass
us2 1 random words in
CodePudding user response:
First, I added a space at the end of each string, as we will combine sentences while grouping. Then I consolidated the sentences after grouping by the st column.
df['string']=df['string'] ' ' # we will use sum function. When combining sentences, there should be spaces in between.
dfx=df.groupby('st').agg({'st':'count','string':'sum'}) #groupby st and combine strings
Then list each word of the string expression, calculate their distribution and get the first 3 values.
from collections import Counter
mask=dfx['string'].apply(lambda x: list(dict(Counter(x.split()).most_common()[:3]).keys()))
print(mask)
'''
st string
us ['with', 'was', 'he']
us1 ['the', 'and', 'The']
us2 ['words', 'random', 'Random']
'''
Finally, add these first 3 words as new columns.
dfx[['top1_word','top2_word','top3_word']]=pd.DataFrame(mask.tolist(), index= mask.index)
dfx
st name_x_count top1_word top2_word top3_word
us 2 with was he
us1 2 the and The
us2 1 words random Random