How to add a new column of aggregated data
I want to create 03 new columns in a dataframe
Column 01: unique_list
Create a new column in the dataframe of unique values of
cfop_code
for eachkey
Column 02: unique_count
A column that check the number of unique values that shows in unique_list
Column 03: not_unique_count
A column that check the number of not unique values that shows in unique_list
example_df
key | product | cfop_code | |
---|---|---|---|
0 | 12345678901234567890 | product a | 2551 |
1 | 12345678901234567890 | product b | 2551 |
2 | 12345678901234567890 | product c | 3551 |
3 | 12345678901234567895 | product a | 2551 |
4 | 12345678901234567897 | product b | 2551 |
5 | 12345678901234567897 | product c | 2407 |
Expected Result
key | product | cfop_code | unique_list | unique_count | not_unique_count | |
---|---|---|---|---|---|---|
0 | 12345678901234567890 | product a | 2551 | 2251, 3551 | 2 | 3 |
1 | 12345678901234567890 | product b | 2551 | 2251, 3551 | 2 | 3 |
2 | 12345678901234567890 | product c | 3551 | 2251, 3551 | 2 | 3 |
3 | 12345678901234567895 | product a | 2551 | 2251 | 1 | 1 |
4 | 12345678901234567897 | product b | 2551 | 2407, 2551 | 2 | 2 |
5 | 12345678901234567897 | product c | 2407 | 2407, 2551 | 2 | 2 |
What i had tried
Create a list of unique values
df.groupby('key')["cfop"].unique()
key
12345678901234567890 [2551, 3551]
12345678901234567895 [2551]
12345678901234567897 [2551, 2407]
Name: cfop, dtype: object
Getting the count not unique values
df.groupby("key").agg(**{"unique_values": pd.NamedAgg(column='cfop', aggfunc="size")}).reset_index()
key unique_values
0 12345678901234567890 3
1 12345678901234567895 1
2 12345678901234567897 2
Getting the count unique values into data frame
df.groupby("key").agg(**{"unique_values": pd.NamedAgg(column='cfop', aggfunc="nunique")}).reset_index()
key unique_values
0 12345678901234567890 2
1 12345678901234567895 1
2 12345678901234567897 2
But FAIL adding a new column
df['unique_list'] = df.groupby('key')["cfop"].unique()
df['unique_count'] = df.groupby("key").agg(**{"unique_values": pd.NamedAgg(column='cfop', aggfunc="nunique")}).reset_index()
df['not_unique_count'] =df.groupby("key").agg(**{"unique_values": pd.NamedAgg(column='cfop', aggfunc="size")}).reset_index()
CodePudding user response:
Try:
tmp = (
df.groupby("key")["cfop_code"]
.agg(
unique_list = lambda s: sorted(s.unique()),
unique_count = "nunique",
not_unique_count = "size")
.reset_index()
)
res = df.merge(tmp, on="key")
print(res)
key product cfop_code unique_list unique_count not_unique_count
0 12345678901234567890 product a 2551 [2551, 3551] 2 3
1 12345678901234567890 product b 2551 [2551, 3551] 2 3
2 12345678901234567890 product c 3551 [2551, 3551] 2 3
3 12345678901234567895 product a 2551 [2551] 1 1
4 12345678901234567897 product b 2551 [2407, 2551] 2 2
5 12345678901234567897 product c 2407 [2407, 2551] 2 2
The problem with your attempt is that:
df.groupby("key").agg(**{"unique_values": pd.NamedAgg(column='cfop_code', aggfunc="nunique")}).reset_index()
returns a DataFrame.You try to assign this whole DataFrame to a new column which fails.
CodePudding user response:
You can do merge after group and agg like:
df.merge(df.groupby('key',as_index=False).agg(
unique_list = ('cfop_code', 'unique'),
unique_count = ('cfop_code', 'nunique'),
not_unique_count = ('cfop_code', 'size')
), on='key', how = 'left')
output:
key product cfop_code unique_list unique_count \
0 12345678901234567890 product a 2551 [2551, 3551] 2
1 12345678901234567890 product b 2551 [2551, 3551] 2
2 12345678901234567890 product c 3551 [2551, 3551] 2
3 12345678901234567895 product a 2551 [2551] 1
4 12345678901234567897 product b 2551 [2551, 2407] 2
5 12345678901234567897 product c 2407 [2551, 2407] 2
not_unique_count
0 3
1 3
2 3
3 1
4 2
5 2