I have a DataFrame which looks like the following
email month level
[email protected] jan EE2
[email protected] jan nan
[email protected] mar MG1
[email protected] jan nan
[email protected] jan nan
[email protected] jun EE3
[email protected] jan nan
[email protected] apr PT
[email protected] jul MG1
[email protected] aug MG1
[email protected] sep MG2
[email protected] sep MG3
I plan to do a groupby
to select the first and last rows for each group.
But before I do that, I'd like to replace the first rows of each employee in the that are "nan", with the next row only if it contains either "EE" or "MG"
I was thinking of creating a new column called level_new
email month level level_new
[email protected] jan EE2 EE2
[email protected] jan nan EE3
[email protected] mar MG1 MG1
[email protected] jan nan MG1
[email protected] jan nan nan
[email protected] jun EE3 EE3
[email protected] jan nan MG1
[email protected] apr PT PT
[email protected] jul MG1 MG1
[email protected] aug MG1 MG1
[email protected] oct MG2 MG2
[email protected] sep MG3 MG3
so that I could achieve the following groupby
email month level level_new
[email protected] jan EE2 EE2
[email protected] mar MG1 MG1
[email protected] jan nan EE3
[email protected] sep MG3 MG3
[email protected] jan nan MG1
[email protected] oct MG2 MG2
[email protected] jan nan nan
[email protected] jan nan MG1
[email protected] jul MG1 MG1
So far, I am only able to select the first and the last rows based on the group by, but this would still select the nan values for the first row of each employee.
#get the first and last row of each group
#".nth[-1]" retrieves the last row
#".nth[0]" retrieves the first row
df2 = df.groupby('email', as_index=False).nth([0,-1])
CodePudding user response:
We could use where
to replace values other than "MG" or "EE" with NaN; then groupby
bfill
fillna
to fill in NaN values in "level" column with the next value that is either "MG" or "EE" for each "email".
Then use groupby
apply a lambda that fetches the index of the first and last value for each "email" as a list explode
the list drop_duplicates
(in case some emails appear only once) to create a mask that returns True for first and last values for each "email" and False otherwise. Then use this mask to filter the relevant outcome:
df['level_new'] = df['level'].fillna(df['level'].where(df['level'].str.contains('MG|EE')).groupby(df['email']).bfill())
out = df.loc[df.groupby('email')['level_new'].apply(lambda x: [x.index.min(), x.index.max()]).explode().drop_duplicates()]
Output:
email month level level_new
4 [email protected] jan NaN NaN
6 [email protected] jan NaN MG1
8 [email protected] jul MG1 MG1
0 [email protected] jan EE2 EE2
2 [email protected] mar MG1 MG1
1 [email protected] jan NaN EE3
11 [email protected] sep MG3 MG3
3 [email protected] jan NaN MG1
10 [email protected] sep MG2 MG2
CodePudding user response:
Define the following fuction to process a group:
def procGrp(grp):
if grp.index.size == 1: # single row only
return grp
if pd.isnull(grp.iat[0,2]):
nxtLev = grp.iat[1,2] # next "level"
if ('EE' in nxtLev) or ('MG' in nxtLev):
grp.iat[0,2] = nxtLev # set in 1-st row
# Return first and last row from this group
return grp.loc[[grp.index[0], grp.index[-1]]]
Then group your DataFrame by email and apply this function:
result = df.groupby('email').apply(procGrp)
For your data sample the result is:
email month level
4 [email protected] jan NaN
6 [email protected] jan MG1
8 [email protected] jul MG1
0 [email protected] jan EE2
2 [email protected] mar MG1
1 [email protected] jan EE3
11 [email protected] sep MG3
3 [email protected] jan NaN
10 [email protected] sep MG2
As you can see:
- row for [email protected] has also still NaN, as this group contains only a single row,
- row for [email protected] has still NaN, as the next row has level == 'PT'.
You don't even need to create any additional column.