I have the following group by in pandas
df = df.groupby("pred_text1").agg({
"rouge_score":"first","pred_text1":"first",
"input_text":"first", "target_text":"first", "prefix":"first"})
I want to have another columns which counts the number in each group (repetition of each pred_text
)
I know that using transform('count')
after groupby
can add such a new column, but I still need agg
function too.
CodePudding user response:
You can use pd.NamedAgg
function.
Code:
import pandas as pd
# A sample dataframe
df = pd.DataFrame({
'pred_text1': [chr(ord('A') i%3) for i in range(10)],
'rouge_score': [i % 4 for i in range(10)],
'input_text': [i % 6 for i in range(10)],
'target_text': [i % 7 for i in range(10)],
'prefix': [i % 8 for i in range(10)],
})
# Aggregation
df = df.groupby("pred_text1").agg(
rouge_score=pd.NamedAgg(column="rouge_score", aggfunc="first"),
pred_text1=pd.NamedAgg(column="pred_text1", aggfunc="first"),
input_text=pd.NamedAgg(column="input_text", aggfunc="first"),
target_text=pd.NamedAgg(column="target_text", aggfunc="first"),
prefix=pd.NamedAgg(column="prefix", aggfunc="first"),
pred_count=pd.NamedAgg(column="pred_text1", aggfunc="count"),
)
Input:
(index) | pred_text1 | rouge_score | input_text | target_text | prefix |
---|---|---|---|---|---|
0 | A | 0 | 0 | 0 | 0 |
1 | B | 1 | 1 | 1 | 1 |
2 | C | 2 | 2 | 2 | 2 |
3 | A | 3 | 3 | 3 | 3 |
4 | B | 0 | 4 | 4 | 4 |
5 | C | 1 | 5 | 5 | 5 |
6 | A | 2 | 0 | 6 | 6 |
7 | B | 3 | 1 | 0 | 7 |
8 | C | 0 | 2 | 1 | 0 |
9 | A | 1 | 3 | 2 | 1 |
Output:
(pred_text1) | rouge_score | pred_text1 | input_text | target_text | prefix | pred_count |
---|---|---|---|---|---|---|
A | 0 | A | 0 | 0 | 0 | 4 |
B | 1 | B | 1 | 1 | 1 | 3 |
C | 2 | C | 2 | 2 | 2 | 3 |
CodePudding user response:
import pandas as pd
df = pd.DataFrame({'Group': ['G1', 'G1', 'G2', 'G3', 'G2', 'G1'], 'ValueLabel': [0, 1, 1, 1, 0, 2]})
You can do this in a few steps: First aggregate.
df = df.groupby('Group').agg({'ValueLabel': ['first', 'count']})
The new columns are a pd.MultiIndex
type which we will flatten. After that we can create a mapping of the names to the labels we want and rename the columns.
df.columns = df.columns.to_flat_index()
mapper = {label: new_label for label, new_label in zip \
(pd.MultiIndex.from_product([['ValueLabel'], \
['first', 'count']]), ['ValueFirstLabel', 'ValueCountLabel'])}
df.rename(mapper, axis=1, inplace=True)