Home > Mobile >  Pandas: replacing nan values conditionally within a group
Pandas: replacing nan values conditionally within a group

Time:11-28

I have a dataframe with missing values. for each index in a column group, i want to replace these values seperately. If all of the values in a group are missing, i want to replace the values with 1. If only some of the values are missing, i want to replace it with data from an imputed dataframe

dataframe 1

index d0_1 d0_2 d1_1 d1_2
group d0 d0 d1 d1
1 3 3 NaN NaN
2 3 NaN 3 3

dataframe 2 (the imputed one)

index d0_1 d0_2 d1_1 d1_2
group d0 d0 d1 d1
1 3 3 2 2
2 3 2 3 3

output:

index d0_1 d0_2 d1_1 d1_2
group d0 d0 d1 d1
1 3 3 1 1
2 3 2 3 3

my data is much larger and the groups are larger as well.

Ive been struggling now with this for days, i just cant seem to find a working solution

my current solution is iterating over all the groups, and using groupby.transform to replace values, but i dont know how to tell the lambda function to take the values from my second data frame, and my current lambda function also doesnt replace all the values with 1 either, instead just returning the old groups with no changes

df1 = pd.read_csv("file.txt", sep = "\t", index_col = "T: Protein.Group")

def group(a: pd.DataFrame):
    a_grouped = a.groupby(["group"] , axis=1)
    return a_grouped

def getgroup(a: pd.DataFrame):
    new_idx = pd.MultiIndex.from_arrays([
        a.columns,
        a.columns.str.extract("(d\d )_\d ", expand = False)
    ], names=["index", "group"])
    a.columns = new_idx
    return a

df1grp = group(getgroup(df1))

for i in list(df1grg.groups.keys())
        df1grp.get_group(i).transform(
            lambda x: 1 if x.eq(np.nan).all() else x
        )

CodePudding user response:

IIUC:

df = df1.mask(df1.groupby('group', axis=1).count() == 0, 1)
df = df.where(~df.isna(), df2)

>>> df
index d0_1 d0_2 d1_1 d1_2
group   d0   d0   d1   d1
1        3    3    1    1
2        3    2    3    3

This is assuming the columns are indeed a MultiIndex as you describe, e.g.:

>>> df1.columns
MultiIndex([('d0_1', 'd0'),
            ('d0_2', 'd0'),
            ('d1_1', 'd1'),
            ('d1_2', 'd1')],
           names=['index', 'group'])

(I initially spent a few minutes making such columns as I thought you had simple Index columns and a first row with the 'group'...)

  • Related