im having the dataframe:
test = pd.DataFrame({'Date': [2020 - 12 - 30, 2020 - 12 - 30, 2020 - 12 - 30, 2020 - 12 - 31, 2020 - 12 - 31, 2021 - 0o1 - 0o1, 2021 - 0o1 - 0o1], 'label': ['Positive', 'Positive', 'Negative', 'Negative','Negative', 'Positive', 'Positive'], 'score': [70, 80, 50, 50, 30, 90, 70]})
Output:
Date label score
2020-12-30 Positive 70
2020-12-30 Positive 80
2020-12-30 Negative 50
2020-12-31 Negative 50
2020-12-31 Negative 30
2021-01-01 Positive 90
2021-01-01 Positive 70
My goal is to group by the date and count the labels. In addition the score should be calculating only the mean with the labels/score which are higher at that day. For example if there is more positive than negative at the day it should calculate the mean of the positive scores without the negative scores and the other way around.
I though about the function new_df = test.groupby(['Date', 'label']).agg({'label' : 'count', 'score' : 'mean'})
Output should be like that. Maybe .pivot function would help?
Date label new_score count_pos count_neg
2020-12-30 Positive 150 2 1
2020-12-31 Negative 80 0 2
2021-01-01 Positive 160 2 0
new_score = 70 80 = 150 of the two days with positive label
count_pos = count "Positive" at day X
count_neg = count "Negative" at day X
Im a beginner in python and any help or hints how to tackle this task is appreciated!
Thanks!
CodePudding user response:
Try:
x = df.pivot_table(
index=["Date", "label"],
columns="label",
values="score",
aggfunc=["sum", "count"],
fill_value=0,
)
x.columns = ["_".join(c) for c in x.columns]
x["new_score"] = x[["sum_Negative", "sum_Positive"]].sum(axis=1)
x = x.drop(columns=["sum_Negative", "sum_Positive"])
x = x.reset_index()
print(x)
Prints:
Date label count_Negative count_Positive new_score
0 2020-12-30 Negative 1 0 50
1 2020-12-30 Positive 0 2 150
2 2020-12-31 Negative 2 0 80
3 2021-01-01 Positive 0 2 160
CodePudding user response:
Use can use groupby followed by pivot; but to filter with the labels/score which are higher at that day, there is some extra processing required:
Full working example:
df = df.groupby(["Date", "label"]).agg(label_count=("label", "count"), score_sum=("score", "sum")).reset_index()
df = df.pivot(index="Date", columns="label").fillna(0)
def get_labels_by_max(row):
_df = row.to_frame().transpose()
_sr = _df["score_sum"][_df["label_count"].idxmax(axis=1).values[0]]
_df_label_count = _df["label_count"].values[0]
return [_sr.name, int(_sr.values[0]), int(_df_label_count[1]), int(_df_label_count[0])]
#
df = df.apply(lambda row: get_labels_by_max(row), axis=1, result_type="expand").reset_index().rename(columns={0:"label", 1:"new_score", 2:"count_pos", 3:"count_neg"})
[Out]:
Date label new_score count_pos count_neg
0 2020-12-30 Positive 150 2 1
1 2020-12-31 Negative 80 0 2
2 2021-01-01 Positive 160 2 0