Home > OS >  filter rows that match a previous condition
filter rows that match a previous condition

Time:07-30

i have a problem that i cannot reach a solution, this is my dataset:

data = [[2022-06-01, 'John', 'A'], [2022-06-02, 'John', 'D'],
    [2022-06-03, 'John', 'D'], [2022-06-01, 'Sara', 'D'],
    [2022-06-02, 'Sara', 'D'], [2022-06-01, 'Lucas', 'A'],
    [2022-06-02, 'Lucas', 'A'], [2022-06-01, 'Mike', 'D'],
    [2022-06-02, 'Mike', 'A'], [2022-06-03, 'Mike', 'D']]

df = pd.DataFrame(data, columns=['Date', 'Name', 'Label'])

I just want to choose rows with D label that don't have an 'A' before the 'D' in each name, my real problem probably has more than 70 rows with each name and i need all the 'D' that match that condition, for example, the solution of this problem is:

Date Name Label

2022-06-01 Sara D

2022-06-02 Sara D

2022-06-01 Mike D

I need help because i didn't find any good aproach to see how i deal the problem, thank you for your help!

CodePudding user response:

Use:

mask = df.assign(indicator=df["Label"].eq("A")).groupby("Name")["indicator"].transform("cumsum").lt(1)
res = df[mask & df["Label"].eq("D")]
print(res)

Output

         Date  Name Label
3  2022-06-01  Sara     D
4  2022-06-02  Sara     D
7  2022-06-01  Mike     D

Step-by-Step

# is True when the values are A
indicator = df["Label"].eq("A")

# create new DataFrame with this new indicator column
temp = df.assign(indicator=indicator)

# group by name in the new DataFrame and find the cumulative sum of this indicator column
# notice that all D that come before an A will have 0
cumsum = temp.groupby("Name")["indicator"].transform("cumsum")

# create the final mask
mask = cumsum.lt(1) & df["Label"].eq("D")
res = df[mask]

print(res)

CodePudding user response:

Use:

#df = pd.DataFrame(data)
#df.columns=['Date', 'Name', 'Label']
df['date'] = pd.to_datetime(df['Date']) # if no datetime type
df2 = df.sort_values('Date') #if no sorted
new_df = df.loc[df2['Label'].eq('D') &
                df2['Label'].ne('A').groupby(df2['Name'], sort=False).cumprod()
         ]
print(new_df)

        Date  name Label
3 2022-06-01  Sara     D
4 2022-06-02  Sara     D
7 2022-06-01  Mike     D
  • Related