Home > Mobile >  groupby() with condition - count / mean
groupby() with condition - count / mean

Time:10-28

im having the dataframe:

test = pd.DataFrame({'Date': [2020 - 12 - 30, 2020 - 12 - 30, 2020 - 12 - 30, 2020 - 12 - 31, 2020 - 12 - 31, 2021 - 0o1 - 0o1, 2021 - 0o1 - 0o1], 'label': ['Positive', 'Positive', 'Negative', 'Negative','Negative', 'Positive', 'Positive'], 'score': [70, 80, 50, 50, 30, 90, 70]})

Output:

   Date         label       score
2020-12-30      Positive    70
2020-12-30      Positive    80
2020-12-30      Negative    50
2020-12-31      Negative    50
2020-12-31      Negative    30
2021-01-01      Positive    90
2021-01-01      Positive    70

My goal is to group by the date and count the labels. In addition the score should be calculating only the mean with the labels/score which are higher at that day. For example if there is more positive than negative at the day it should calculate the mean of the positive scores without the negative scores and the other way around.

I though about the function new_df = test.groupby(['Date', 'label']).agg({'label' : 'count', 'score' : 'mean'})

Output should be like that. Maybe .pivot function would help?

   Date          label     new_score   count_pos  count_neg 
2020-12-30      Positive      150         2          1        
2020-12-31      Negative      80          0          2       
2021-01-01      Positive      160         2          0


new_score = 70   80 = 150 of the two days with positive label
count_pos = count "Positive" at day X
count_neg = count "Negative" at day X

  

Im a beginner in python and any help or hints how to tackle this task is appreciated!

Thanks!

CodePudding user response:

Try:

x = df.pivot_table(
    index=["Date", "label"],
    columns="label",
    values="score",
    aggfunc=["sum", "count"],
    fill_value=0,
)

x.columns = ["_".join(c) for c in x.columns]
x["new_score"] = x[["sum_Negative", "sum_Positive"]].sum(axis=1)
x = x.drop(columns=["sum_Negative", "sum_Positive"])
x = x.reset_index()

print(x)

Prints:

         Date     label  count_Negative  count_Positive  new_score
0  2020-12-30  Negative               1               0         50
1  2020-12-30  Positive               0               2        150
2  2020-12-31  Negative               2               0         80
3  2021-01-01  Positive               0               2        160

CodePudding user response:

Use can use groupby followed by pivot; but to filter with the labels/score which are higher at that day, there is some extra processing required:

Full working example:

df = df.groupby(["Date", "label"]).agg(label_count=("label", "count"), score_sum=("score", "sum")).reset_index()

df = df.pivot(index="Date", columns="label").fillna(0)


def get_labels_by_max(row):
  _df = row.to_frame().transpose()
  _sr = _df["score_sum"][_df["label_count"].idxmax(axis=1).values[0]]
  _df_label_count = _df["label_count"].values[0]
  return [_sr.name, int(_sr.values[0]), int(_df_label_count[1]), int(_df_label_count[0])]
# 
df = df.apply(lambda row: get_labels_by_max(row), axis=1, result_type="expand").reset_index().rename(columns={0:"label", 1:"new_score", 2:"count_pos", 3:"count_neg"})

[Out]:
         Date     label  new_score  count_pos  count_neg
0  2020-12-30  Positive        150          2          1
1  2020-12-31  Negative         80          0          2
2  2021-01-01  Positive        160          2          0
  • Related