Home > Enterprise >  Classify DataFrame rows based on the 3 most recent columns
Classify DataFrame rows based on the 3 most recent columns

Time:04-20

I have a pandas DataFrame, each column represents a quarter, the most recent quarters are placed to the right, not all the information gets at the same time, some columns might be missing information

I would like to add at the end of the DataFrame another column called Criteria:

  • If the 3 most recent quarters with available data each has a value greater than 10 then Criteria = Y (that means ignore NaN when applying the condition)
  • Otherwise N

In the sample below, B, D and E would be equal to Y, the rest would be marked as N

But I can't find a way to apply the criteria dynamically to ignore the NaN values

import pandas as pd
import numpy as np

list_of_tuples = [
    (11, 34, 78, 5, -11, -56),
    (12, np.NaN, 98, 7, np.NaN, 18),
    (21, 56, -78, -23, 64, 28),
    (56, -98, 35, 63, 27, np.NaN),
    (13, 34, -11, 11, 56, 10),
    (12, 41, 12, 41, -78, -18),
]

df = pd.DataFrame(
    list_of_tuples,
    index=['A', 'B', 'C', 'D', 'E', 'F'],
    columns=['2020Q4', '2021Q1', '2021Q2', '2021Q3', '2021Q4', '2022Q1'],
)

print(df)

Any help would be appreciated, thanks

CodePudding user response:

IIUC, you can stack and use a custom groupby function:

df['Criteria'] = np.where(df.stack().groupby(level=0)
                            .apply(lambda s: s.tail(3).ge(10).all()),
                          'Y', 'N')

NB. B is "N" as 2021Q3 is < 10

output:

   2020Q4  2021Q1  2021Q2  2021Q3  2021Q4  2022Q1 Criteria
A      11    34.0      78       5   -11.0   -56.0        N
B      12     NaN      98       7     NaN    18.0        N
C      21    56.0     -78     -23    64.0    28.0        N
D      56   -98.0      35      63    27.0     NaN        Y
E      13    34.0     -11      11    56.0    10.0        Y
F      12    41.0      12      41   -78.0   -18.0        N
  • Related