Python/Pandas: conditionally collapse rows, keep most common value, and sum counts-CodePudding

I have a large pandas dataframe with common sentences 'S', which I'd like to learn, and their counts 'C', like this:

df = pd.DataFrame({"S": ["Yes.", "Yes!", "Yes?",
                       "No?", "No.", "What?"],
                 "C": [100, 50, 40, 30, 10, 5]})
df
       S    C
0   Yes.  100
1   Yes!   50
2   Yes?   40
3    No?   30
4    No.   10
5  What?    5

If sentences are the same up to their last character, I'd like to collapse them into one, represented by the most common sentence, while adding up the counts across the group. The resulting dataframe should look like this:

       S    C
0   Yes.  190
1    No?   40
2  What?    5

In the same way, I have list of common words - some lowercase, some uppercase - and I'd like to add the counts across the lower/uppercase group while retaining only the most common case.

CodePudding user response：

First create a column for 'S' until the last character in the string

df['s'] = df['S'].str[:-1]

Then you can use groupby() to sum up 'C' and get the most common occurrence in 'S'. Example

most_common = df[df['C'] == df.groupby('s')['C'].transform('max')][['S', 's']]
total = df.groupby('s')['C'].sum().reset_index()
most_common.merge(total).drop(columns=['s']).sort_values(by='C', ascending=False)

Same for uppercase/lowercase using str.lower()

CodePudding user response：

here is one way to do it

# using replace, remove the punctuation characters with empty string
# then groupby with sum

(df.assign(text=df['S'].str.replace('[\.|!|\?|\s]', '', regex=True))
 .groupby('text', as_index=False)
 .sum()
 .rename(columns={'text':'S'})
)

    text     C
0   No       40
1   What      5
2   Yes     190