How to compare current row with previous 2 rows based on certain conditions in Python-CodePudding

I have a dataset similar to below, I want to compare all rows for each inventory item based on the conditions that

if "value" column is not null, then flag it "yes"
Else if "value" column for all row (per each inventory) is null, then evaluate the "item" column, if either "item" column is not null, then that row should be flagged "yes". Or 1 record under "value" is null where 2 records under "value" is not null, then those 2 rows need to be evaluated based on the item
In the case that if both "value" and "item" are both null(per each inventory), evaluate the "year" column and flag the most updated year "yes". Or "value" is null but 1 record under "item" is null where 2 records under "item" is not null, then evaluate the "year" column and flag the most updated year "yes"
df['flag'].fillna('no',inplace=True)

I can set up this if then else logic but I don't know how to compare current row with 2 previous rows. The groupby with transform and custom function are great advices and I wonder how to capture all these scenarios or if there is a better way.

Below is how the simplified df would look like. In my real dataset, as described above, there are also cases that for the same inventory, "value" column are all null, but there are 2 rows under "item" are not null, in that case, "year" column needs to be evaluated for those 2 rows.

df1 = { 'inventory':['inv1','inv1','inv1','inv2','inv2','inv2','inv3','inv3','inv3'],
  'value':['xyz','','','','','','','',''],
  'item':['','304','304','','205','','','',''],
   'year':[2020,2020,2020,2020,2020,2020,2019,2018,2020]}

df1=pd.DataFrame(df1)

desired output would be like below - adding a flag column to flag 'yes'/'no' based on the above multiple condtions.

CodePudding user response：

You can write your rules into a custom function and apply it to each group:

# Replace blank spaces with NaN
df1 = df1.replace('', np.nan)

def make_flag_col(subdf):
    if subdf['value'].any():
        return subdf['value'].notna()
    elif subdf['item'].any():
        return subdf['item'].notna()
    else:
        return subdf['year'] == subdf['year'].max()

df1['flag'] = (df1.groupby('inventory', group_keys=False)
                  .apply(make_flag_col)
                  .replace({True: 'yes', False: 'no'}))

print(df1)

  inventory value item  year flag
0      inv1   xyz  NaN  2020  yes
1      inv1   NaN  304  2020   no
2      inv1   NaN  304  2020   no
3      inv2   NaN  NaN  2020   no
4      inv2   NaN  205  2020  yes
5      inv2   NaN  NaN  2020   no
6      inv3   NaN  NaN  2019   no
7      inv3   NaN  NaN  2018   no
8      inv3   NaN  NaN  2020  yes

CodePudding user response：

This is actually not very complex. You mostly need a few groupby transform combinations. Here's a vectorized (read: very fast) solution:

df = df.replace('', np.nan)

cond = (
    df['value'].notna() |
        (df['value'].isna().groupby(df['inventory']).transform('all') 
            & df['item'].notna()) |
        (df[['value', 'item']].isna().groupby(df['inventory']).transform('all').all(axis=1) &
            (df['year'] == df.groupby('inventory')['year'].transform('max')))
)

df['flag'] = cond.map({True: 'yes', False: 'no'})

Output:

>>> df
  inventory value item  year flag
0      inv1   xyz  NaN  2020  yes
1      inv1   NaN  304  2020   no
2      inv1   NaN  304  2020   no
3      inv2   NaN  NaN  2020   no
4      inv2   NaN  205  2020  yes
5      inv2   NaN  NaN  2020   no
6      inv3   NaN  NaN  2019   no
7      inv3   NaN  NaN  2018   no
8      inv3   NaN  NaN  2020  yes