Replace NaN value of first row in a groupby with value of next row which contains specific value

I have a DataFrame which looks like the following

email              month      level
[email protected]    jan        EE2 
[email protected]    jan        nan
[email protected]    mar        MG1 
[email protected]   jan        nan
[email protected]    jan        nan
[email protected]    jun        EE3
[email protected]   jan        nan
[email protected]   apr        PT 
[email protected]   jul        MG1
[email protected]   aug        MG1
[email protected]   sep        MG2 
[email protected]    sep        MG3

I plan to do a groupby to select the first and last rows for each group.

But before I do that, I'd like to replace the first rows of each employee in the that are "nan", with the next row only if it contains either "EE" or "MG"

I was thinking of creating a new column called level_new

email              month      level     level_new
[email protected]    jan        EE2       EE2
[email protected]    jan        nan       EE3
[email protected]    mar        MG1       MG1
[email protected]   jan        nan       MG1
[email protected]    jan        nan       nan
[email protected]    jun        EE3       EE3
[email protected]   jan        nan       MG1
[email protected]   apr        PT        PT
[email protected]   jul        MG1       MG1
[email protected]   aug        MG1       MG1
[email protected]   oct        MG2       MG2  
[email protected]    sep        MG3       MG3

so that I could achieve the following groupby

email              month      level     level_new
[email protected]    jan        EE2       EE2
[email protected]    mar        MG1       MG1
[email protected]    jan        nan       EE3
[email protected]    sep        MG3       MG3
[email protected]   jan        nan       MG1
[email protected]   oct        MG2       MG2  
[email protected]    jan        nan       nan
[email protected]   jan        nan       MG1
[email protected]   jul        MG1       MG1

So far, I am only able to select the first and the last rows based on the group by, but this would still select the nan values for the first row of each employee.

#get the first and last row of each group
#".nth[-1]" retrieves the last row
#".nth[0]" retrieves the first row
df2 = df.groupby('email', as_index=False).nth([0,-1])

CodePudding user response：

We could use where to replace values other than "MG" or "EE" with NaN; then groupby bfill fillna to fill in NaN values in "level" column with the next value that is either "MG" or "EE" for each "email".

Then use groupby apply a lambda that fetches the index of the first and last value for each "email" as a list explode the list drop_duplicates (in case some emails appear only once) to create a mask that returns True for first and last values for each "email" and False otherwise. Then use this mask to filter the relevant outcome:

df['level_new'] = df['level'].fillna(df['level'].where(df['level'].str.contains('MG|EE')).groupby(df['email']).bfill())
out = df.loc[df.groupby('email')['level_new'].apply(lambda x: [x.index.min(), x.index.max()]).explode().drop_duplicates()]

Output:

               email month level level_new
4    [email protected]   jan   NaN       NaN
6   [email protected]   jan   NaN       MG1
8   [email protected]   jul   MG1       MG1
0    [email protected]   jan   EE2       EE2
2    [email protected]   mar   MG1       MG1
1    [email protected]   jan   NaN       EE3
11   [email protected]   sep   MG3       MG3
3   [email protected]   jan   NaN       MG1
10  [email protected]   sep   MG2       MG2

CodePudding user response：

Define the following fuction to process a group:

def procGrp(grp):
    if grp.index.size == 1:    # single row only
        return grp
    if pd.isnull(grp.iat[0,2]):
        nxtLev = grp.iat[1,2]  # next "level"
        if ('EE' in nxtLev) or ('MG' in nxtLev):
            grp.iat[0,2] = nxtLev  # set in 1-st row
    # Return first and last row from this group
    return grp.loc[[grp.index[0], grp.index[-1]]]

Then group your DataFrame by email and apply this function:

result = df.groupby('email').apply(procGrp)

For your data sample the result is:

               email month level
4    [email protected]   jan   NaN
6   [email protected]   jan   MG1
8   [email protected]   jul   MG1
0    [email protected]   jan   EE2
2    [email protected]   mar   MG1
1    [email protected]   jan   EE3
11   [email protected]   sep   MG3
3   [email protected]   jan   NaN
10  [email protected]   sep   MG2

As you can see:

row for [email protected] has also still NaN, as this group contains only a single row,
row for [email protected] has still NaN, as the next row has level == 'PT'.

You don't even need to create any additional column.