Group by using agg and count-CodePudding

I have the following group by in pandas

   df = df.groupby("pred_text1").agg({
        "rouge_score":"first","pred_text1":"first", 
        "input_text":"first", "target_text":"first", "prefix":"first"})

I want to have another columns which counts the number in each group (repetition of each pred_text)

I know that using transform('count') after groupby can add such a new column, but I still need agg function too.

CodePudding user response：

You can use pd.NamedAgg function.

Code:

import pandas as pd

# A sample dataframe
df = pd.DataFrame({
        'pred_text1': [chr(ord('A') i%3) for i in range(10)],
        'rouge_score': [i % 4 for i in range(10)],
        'input_text': [i % 6 for i in range(10)],
        'target_text': [i % 7 for i in range(10)],
        'prefix': [i % 8 for i in range(10)],
    })

# Aggregation
df = df.groupby("pred_text1").agg(
        rouge_score=pd.NamedAgg(column="rouge_score", aggfunc="first"),
        pred_text1=pd.NamedAgg(column="pred_text1", aggfunc="first"),
        input_text=pd.NamedAgg(column="input_text", aggfunc="first"),
        target_text=pd.NamedAgg(column="target_text", aggfunc="first"),
        prefix=pd.NamedAgg(column="prefix", aggfunc="first"),
        pred_count=pd.NamedAgg(column="pred_text1", aggfunc="count"),
    )

Input:

(index)	pred_text1	rouge_score	input_text	target_text	prefix
0	A	0	0	0	0
1	B	1	1	1	1
2	C	2	2	2	2
3	A	3	3	3	3
4	B	0	4	4	4
5	C	1	5	5	5
6	A	2	0	6	6
7	B	3	1	0	7
8	C	0	2	1	0
9	A	1	3	2	1

Output:

(pred_text1)	rouge_score	pred_text1	input_text	target_text	prefix	pred_count
A	0	A	0	0	0	4
B	1	B	1	1	1	3
C	2	C	2	2	2	3

CodePudding user response：

import pandas as pd
df = pd.DataFrame({'Group': ['G1', 'G1', 'G2', 'G3', 'G2', 'G1'], 'ValueLabel': [0, 1, 1, 1, 0, 2]})

You can do this in a few steps: First aggregate.

df = df.groupby('Group').agg({'ValueLabel': ['first', 'count']})

The new columns are a pd.MultiIndex type which we will flatten. After that we can create a mapping of the names to the labels we want and rename the columns.

df.columns = df.columns.to_flat_index()
mapper = {label: new_label for label, new_label in zip \
         (pd.MultiIndex.from_product([['ValueLabel'], \
         ['first', 'count']]), ['ValueFirstLabel', 'ValueCountLabel'])}
df.rename(mapper, axis=1, inplace=True)

(index)	pred_text1	rouge_score	input_text	target_text	prefix
0	A	0	0	0	0
1	B	1	1	1	1
2	C	2	2	2	2
3	A	3	3	3	3
4	B	0	4	4	4
5	C	1	5	5	5
6	A	2	0	6	6
7	B	3	1	0	7
8	C	0	2	1	0
9	A	1	3	2	1

(index)	pred_text1	rouge_score	input_text	target_text	prefix
0	A	0	0	0	0
1	B	1	1	1	1
2	C	2	2	2	2
3	A	3	3	3	3
4	B	0	4	4	4
5	C	1	5	5	5
6	A	2	0	6	6
7	B	3	1	0	7
8	C	0	2	1	0
9	A	1	3	2	1

(index)	pred_text1	rouge_score	input_text	target_text	prefix
0	A	0	0	0	0
1	B	1	1	1	1
2	C	2	2	2	2
3	A	3	3	3	3
4	B	0	4	4	4
5	C	1	5	5	5
6	A	2	0	6	6
7	B	3	1	0	7
8	C	0	2	1	0
9	A	1	3	2	1