I need to create new column (Unrated Duplicate ?) with True / False values. If subset (Start, End, Qty, Sig, Rate) is exact duplicate for each unique date range (Start, End), then assign True to rows where specific value (Charged ? == 'N').
Original df:
Account Start End Qty Sig Rate Charged?
1234567890 2021-08-18 2021-09-17 1174 A ABC N
1234567890 2021-08-18 2021-09-17 0 B ABC N
1234567890 2021-08-18 2021-09-17 0 C ABC N
1234567890 2021-08-18 2021-09-17 224 D ABC N
1234567890 2021-08-18 2021-09-17 1398 TOT ABC N
1234567890 2021-08-18 2021-09-17 2024 A ABC Y
1234567890 2021-08-18 2021-09-17 0 B ABC Y
1234567890 2021-08-18 2021-09-17 0 C ABC Y
1234567890 2021-08-18 2021-09-17 414 D ABC Y
1234567890 2021-08-18 2021-09-17 2438 TOT ABC Y
1234567890 2021-09-17 2021-10-16 548 A ABC N
1234567890 2021-09-17 2021-10-16 586 B ABC N
1234567890 2021-09-17 2021-10-16 113 C ABC N
1234567890 2021-09-17 2021-10-16 105 D ABC N
1234567890 2021-09-17 2021-10-16 1352 TOT ABC N
1234567890 2021-09-17 2021-10-16 548 A ABC Y
1234567890 2021-09-17 2021-10-16 586 B ABC Y
1234567890 2021-09-17 2021-10-16 113 C ABC Y
1234567890 2021-09-17 2021-10-16 105 D ABC Y
1234567890 2021-09-17 2021-10-16 1352 TOT ABC Y
output should be:
Account Start End Qty Sig Rate Charged? Unr Dupl ?
1234567890 2021-08-18 2021-09-17 1174 A ABC N False
1234567890 2021-08-18 2021-09-17 0 B ABC N False
1234567890 2021-08-18 2021-09-17 0 C ABC N False
1234567890 2021-08-18 2021-09-17 224 D ABC N False
1234567890 2021-08-18 2021-09-17 1398 TOT ABC N False
1234567890 2021-08-18 2021-09-17 2024 A ABC Y False
1234567890 2021-08-18 2021-09-17 0 B ABC Y False
1234567890 2021-08-18 2021-09-17 0 C ABC Y False
1234567890 2021-08-18 2021-09-17 414 D ABC Y False
1234567890 2021-08-18 2021-09-17 2438 TOT ABC Y False
1234567890 2021-09-17 2021-10-16 548 A ABC N True
1234567890 2021-09-17 2021-10-16 586 B ABC N True
1234567890 2021-09-17 2021-10-16 113 C ABC N True
1234567890 2021-09-17 2021-10-16 105 D ABC N True
1234567890 2021-09-17 2021-10-16 1352 TOT ABC N True
1234567890 2021-09-17 2021-10-16 548 A ABC Y False
1234567890 2021-09-17 2021-10-16 586 B ABC Y False
1234567890 2021-09-17 2021-10-16 113 C ABC Y False
1234567890 2021-09-17 2021-10-16 105 D ABC Y False
1234567890 2021-09-17 2021-10-16 1352 TOT ABC Y False
My code, but it doesn't work the way I want:
duplicate_read = []
for i in df["Account"].unique():
for ind in df[df["Account"] == i].index:
duplicate_read.append(df[df["Account"] == i].duplicated(subset = ["Start", "End", "Qty", "Sig", "Rate"], keep = 'last')[ind])
df["Unrated Duplicate ?"] = duplicate_read
df
Thank you.
CodePudding user response:
Try:
df["Duplicate"] = df.duplicated(["Start", "End", "Qty", "Sig", "Rate"], keep=False)
df["Unr Dupl?"] = df.groupby(["Start","End"])["Duplicate"].transform("all")&df["Charged?"].eq("N")
df = df.drop("Duplicate", axis=1)
>>> df
Account Start End Qty Sig Rate Charged? Unr Dupl?
0 1234567890 2021-08-18 2021-09-17 1174 A ABC N False
1 1234567890 2021-08-18 2021-09-17 0 B ABC N False
2 1234567890 2021-08-18 2021-09-17 0 C ABC N False
3 1234567890 2021-08-18 2021-09-17 224 D ABC N False
4 1234567890 2021-08-18 2021-09-17 1398 TOT ABC N False
5 1234567890 2021-08-18 2021-09-17 2024 A ABC Y False
6 1234567890 2021-08-18 2021-09-17 0 B ABC Y False
7 1234567890 2021-08-18 2021-09-17 0 C ABC Y False
8 1234567890 2021-08-18 2021-09-17 414 D ABC Y False
9 1234567890 2021-08-18 2021-09-17 2438 TOT ABC Y False
10 1234567890 2021-09-17 2021-10-16 548 A ABC N True
11 1234567890 2021-09-17 2021-10-16 586 B ABC N True
12 1234567890 2021-09-17 2021-10-16 113 C ABC N True
13 1234567890 2021-09-17 2021-10-16 105 D ABC N True
14 1234567890 2021-09-17 2021-10-16 1352 TOT ABC N True
15 1234567890 2021-09-17 2021-10-16 548 A ABC Y False
16 1234567890 2021-09-17 2021-10-16 586 B ABC Y False
17 1234567890 2021-09-17 2021-10-16 113 C ABC Y False
18 1234567890 2021-09-17 2021-10-16 105 D ABC Y False
19 1234567890 2021-09-17 2021-10-16 1352 TOT ABC Y False