Home > Blockchain >  Drop a row when a column value is duplicated if another column meets a criteria
Drop a row when a column value is duplicated if another column meets a criteria

Time:03-23

I have a DataFrame where I want to drop a row (or rows) where more than one consecutive row has the same value for a column based on the value of a different column. In this instance, I want to keep the row with the highest value in B if high or the lowest if low. In essence, I'm trying to only have highs followed by lows and lows by highs.

df = pd.DataFrame({'A': ['low', 'high', 'high', 'low', 'low','low'],                   
                   'B': [10, 70, 90, 40, 50,60]}) 

Output:

     A     B
0   low   10
1   high  70
2   high  90
3   low   40
4   low   50
5   low   60

Desired:

     A     B
0   low   10
2   high  90
3   low   40

Trying to get my head around how to implement the logic and have run into a brick wall.

CodePudding user response:

Here is another way:

d = {'low':-1}

(df.assign(B = df['B'].mul(df['A'].map(d),fill_value=1))
 .groupby(['A',pd.Series(pd.factorize(df['A'])[0]).diff().ne(0).cumsum()]).max()
 .abs()
 .sort_index(level=1)
 .droplevel(1)
 .reset_index())

Output:

      A     B
0   low  10.0
1  high  90.0
2   low  40.0

CodePudding user response:

Here's a quick and dirty way using groupby.apply:

out = (df.groupby(['A', df['A'].ne(df['A'].shift()).cumsum()])
       .apply(lambda x: x.max() if x['A'].iat[0]=='high' else x.min())
       .droplevel(0).sort_index().reset_index(drop=True))

Another way could be to first find groupby max; then mask the "low" values and replace them with groupby min values:

g = df.groupby(['A', df['A'].ne(df['A'].shift()).cumsum()], sort=False)['B']
out = g.max().mask(lambda x: x.index.get_level_values(0)=='low', g.min()).droplevel(1).reset_index()

Output:

      A   B
0   low  10
1  high  90
2   low  40
  • Related