I have a dataframe with missing values. for each index in a column group, i want to replace these values seperately. If all of the values in a group are missing, i want to replace the values with 1. If only some of the values are missing, i want to replace it with data from an imputed dataframe
dataframe 1
index | d0_1 | d0_2 | d1_1 | d1_2 |
---|---|---|---|---|
group | d0 | d0 | d1 | d1 |
1 | 3 | 3 | NaN | NaN |
2 | 3 | NaN | 3 | 3 |
dataframe 2 (the imputed one)
index | d0_1 | d0_2 | d1_1 | d1_2 |
---|---|---|---|---|
group | d0 | d0 | d1 | d1 |
1 | 3 | 3 | 2 | 2 |
2 | 3 | 2 | 3 | 3 |
output:
index | d0_1 | d0_2 | d1_1 | d1_2 |
---|---|---|---|---|
group | d0 | d0 | d1 | d1 |
1 | 3 | 3 | 1 | 1 |
2 | 3 | 2 | 3 | 3 |
my data is much larger and the groups are larger as well.
Ive been struggling now with this for days, i just cant seem to find a working solution
my current solution is iterating over all the groups, and using groupby.transform to replace values, but i dont know how to tell the lambda function to take the values from my second data frame, and my current lambda function also doesnt replace all the values with 1 either, instead just returning the old groups with no changes
df1 = pd.read_csv("file.txt", sep = "\t", index_col = "T: Protein.Group")
def group(a: pd.DataFrame):
a_grouped = a.groupby(["group"] , axis=1)
return a_grouped
def getgroup(a: pd.DataFrame):
new_idx = pd.MultiIndex.from_arrays([
a.columns,
a.columns.str.extract("(d\d )_\d ", expand = False)
], names=["index", "group"])
a.columns = new_idx
return a
df1grp = group(getgroup(df1))
for i in list(df1grg.groups.keys())
df1grp.get_group(i).transform(
lambda x: 1 if x.eq(np.nan).all() else x
)
CodePudding user response:
IIUC:
df = df1.mask(df1.groupby('group', axis=1).count() == 0, 1)
df = df.where(~df.isna(), df2)
>>> df
index d0_1 d0_2 d1_1 d1_2
group d0 d0 d1 d1
1 3 3 1 1
2 3 2 3 3
This is assuming the columns are indeed a MultiIndex
as you describe, e.g.:
>>> df1.columns
MultiIndex([('d0_1', 'd0'),
('d0_2', 'd0'),
('d1_1', 'd1'),
('d1_2', 'd1')],
names=['index', 'group'])
(I initially spent a few minutes making such columns as I thought you had simple Index columns and a first row with the 'group'
...)