compare value in two rows in a column pandas-CodePudding

I have a pandas df something like this:

           color          pct               days               text
  1         red            5                 7                 good
  2         red           10                30                 good
  3         red           11                60                  bad
  4         blue           6                 7                  bad
  5         blue          15                30                 good
  6         blue          21                60                  bad
  7        yellow          2                 7                 good
  8        yellow          5                30                  bad
  9        yellow          7                60                  bad

So basically, for each color, I have percentage values for 7 days, 30 days and 60 days. Please note that these are not always in correct order as I gave in example above. My task now is to look at the change in percentage for each color between the consecutive days values and if the change is greater or equal to 5%, then write in column "text" as "NA". Text in days 7 category is default and cannot be overwritten.

Desired result:

           color          pct               days               text
  1         red            5                 7                 good
  2         red           10                30                  NA
  3         red           11                60                  bad
  4         blue           6                 7                  bad
  5         blue          15                30                  NA
  6         blue          21                60                  NA
  7        yellow          2                 7                 good
  8        yellow          5                30                  bad
  9        yellow          7                60                  bad

I am able to achieve this by a very very long process that I am very sure is not efficient. I am sure there is a much better way of doing this, but I am new to python, so struggling. Can someone please help me with this? Many thanks in advance

CodePudding user response：

In the code below I'm reading your example table into a pandas dataframe using io, you don't need to do this, you already have your pandas table.

import pandas as pd
import io

df = pd.read_csv(io.StringIO(
"""           color          pct               days               text
  1         red            5                 7                 good
  2         red           10                30                 good
  3         red           11                60                  bad
  4         blue           6                 7                  bad
  5         blue          15                30                 good
  6         blue          21                60                  bad
  7        yellow          2                 7                 good
  8        yellow          5                30                  bad
  9        yellow          7                60                  bad"""
),delim_whitespace=True)

not_seven_rows = df['days'].ne(7)
good_rows = df['pct'].lt(5)

#Set the rows which are < 5 and not 7 days to be 'good'
df.loc[good_rows & not_seven_rows, 'text'] = 'good'

#Set the rows which are >= 5 and not 7 days to be 'NA'
df.loc[(~good_rows) & not_seven_rows, 'text'] = 'NA'

df

Output

CodePudding user response：

A variation on a (now-deleted) suggested answer as comment:

# ensure numeric data
df['pct'] = pd.to_numeric(df['pct'], errors='coerce')
df['days'] = pd.to_numeric(df['days'], errors='coerce')

# update in place
df.loc[df.sort_values(['color','days'])
         .groupby('color')['pct']
         .diff().ge(5), 'text'] = 'NA'

Output:

    color  pct  days  text
1     red    5     7  good
2     red   10    30    NA
3     red   11    60   bad
4    blue    6     7   bad
5    blue   15    30    NA
6    blue   21    60    NA
7  yellow    2     7  good
8  yellow    5    30   bad
9  yellow    7    60   bad