i have a problem that i cannot reach a solution, this is my dataset:
data = [[2022-06-01, 'John', 'A'], [2022-06-02, 'John', 'D'],
[2022-06-03, 'John', 'D'], [2022-06-01, 'Sara', 'D'],
[2022-06-02, 'Sara', 'D'], [2022-06-01, 'Lucas', 'A'],
[2022-06-02, 'Lucas', 'A'], [2022-06-01, 'Mike', 'D'],
[2022-06-02, 'Mike', 'A'], [2022-06-03, 'Mike', 'D']]
df = pd.DataFrame(data, columns=['Date', 'Name', 'Label'])
I just want to choose rows with D label that don't have an 'A' before the 'D' in each name, my real problem probably has more than 70 rows with each name and i need all the 'D' that match that condition, for example, the solution of this problem is:
Date Name Label
2022-06-01 Sara D
2022-06-02 Sara D
2022-06-01 Mike D
I need help because i didn't find any good aproach to see how i deal the problem, thank you for your help!
CodePudding user response:
Use:
mask = df.assign(indicator=df["Label"].eq("A")).groupby("Name")["indicator"].transform("cumsum").lt(1)
res = df[mask & df["Label"].eq("D")]
print(res)
Output
Date Name Label
3 2022-06-01 Sara D
4 2022-06-02 Sara D
7 2022-06-01 Mike D
Step-by-Step
# is True when the values are A
indicator = df["Label"].eq("A")
# create new DataFrame with this new indicator column
temp = df.assign(indicator=indicator)
# group by name in the new DataFrame and find the cumulative sum of this indicator column
# notice that all D that come before an A will have 0
cumsum = temp.groupby("Name")["indicator"].transform("cumsum")
# create the final mask
mask = cumsum.lt(1) & df["Label"].eq("D")
res = df[mask]
print(res)
CodePudding user response:
Use:
#df = pd.DataFrame(data)
#df.columns=['Date', 'Name', 'Label']
df['date'] = pd.to_datetime(df['Date']) # if no datetime type
df2 = df.sort_values('Date') #if no sorted
new_df = df.loc[df2['Label'].eq('D') &
df2['Label'].ne('A').groupby(df2['Name'], sort=False).cumprod()
]
print(new_df)
Date name Label
3 2022-06-01 Sara D
4 2022-06-02 Sara D
7 2022-06-01 Mike D