Home > Back-end >  Fill a DataFrame with medians of group only for cell containing specific value
Fill a DataFrame with medians of group only for cell containing specific value

Time:07-18

I am trying to find a nice/smart way to fill my DataFrame with median value from groups.
I have 2 groups "I" and "J" and 2 factors"A" and "B". I want to replace my negative value by the median of the group to which the value belongs.
One constraint is that I don't want to replace the NaN values.
Here is the code to make my initial DataFrame

tuples = [('I','0'), ('I','1'), ('I', '2'), ('J', '3'), ('I', '4'), ('J', '5')]
index = pd.MultiIndex.from_tuples(tuples, names=["Id1", "Id2"])
df = pd.DataFrame(np.arange(12).reshape(-1, 2), columns=['A', 'B'], index=index)
df["A"].iloc[0]=-1
df["B"].iloc[-1:]=-1
df["B"].iloc[-2]=18
df["B"].iloc[0]=np.NaN
df["B"].iloc[1]=np.NaN

which gives:

A   B
Id1 Id2     
I   0   -1  NaN
    1   2   NaN
    2   4   5.0
J   3   6   7.0
I   4   8   18.0
J   5   10  -1.0

Here is the way I solved it:

ind, col = np.where(df<0)
nb_df_lt_0 = len(ind)
for ii in np.arange(nb_df_lt_0) : 
    df.iloc[ind[ii],col[ii]] = np.NAN
    xx, yy = ind[ii], col[ii]
    index_Id1 = df.index.get_level_values("Id1")[xx]
    df.iloc[xx,yy] = df.loc[index_Id1,:].iloc[:,yy].median()
df

This gives what I was looking for:

A   B
Id1 Id2     
I   0   4.0  NaN
    1   2.0  NaN
    2   4.0  5.0
J   3   6.0  7.0
I   4   8.0  18.0
J   5   10.0 7.0

It works, but it doesn't look nice, and surely not very efficient since I have a For loop. I would be very please to look at a solution with pandas or numpy functions which make the job.
Thanks in advance

CodePudding user response:

You can do something like this:

df.mask(df<0, df.mask(df<0, np.nan).groupby(level=0).median())

Lets break that down. You need the median of the two groups "I" and "J" excluding the negative values:

median_df = df.mask(df<0, np.nan).groupby(level=0).median()

Then you want to substitute the median for the negative values in the original DataFrame:

df.mask(df<0, median_df)

CodePudding user response:

You can use this:

It groups each col and then finds the median of the group (not including the -1 values.)

for col in df.columns:
    df[col] = df.groupby('Id1')[col].apply(lambda x: (
        x.replace(-1, x.loc[x != -1].median())
    ))

CodePudding user response:

Let's start from a small correction in the way you created the source DataFrame: As each column can contain NaN, which is a special case of float, create the temporary DataFrame with data type of float:

np.arange(12, dtype='float')

(no change in the rest of your code to create the DataFrame).

You will need the following group processing function:

def grpProc(grp):
    grp[grp == -1] = grp[grp != -1].median()
    return grp

It computes the median from elements != 0 and saves it in elements == -1, assuming that the source group (grp) is a part of the current column for each Id1. Then the changed group is returned.

And to get the result, apply it to each column of your DataFrame, grouped by Id1 (level 0):

result = df.apply(lambda col: col.groupby(level=0).apply(grpProc))

No axis parameter has been passed, so this function is applied to each column (axis == 0).

For your sample data the result is:

            A     B
Id1 Id2            
I   0     4.0   NaN
    1     2.0   NaN
    2     4.0   5.0
J   3     6.0   7.0
I   4     8.0  18.0
J   5    10.0   7.0
  • Related