I have missing values in one column that I would like to fill by random sampling from a source distribution:
import pandas as pd
import numpy as np
source = pd.DataFrame({'age':5*[21],
'location':[0,0,1,1,1],
'x':[1,2,3,4,4]})
source
age location x
0 21 0 1
1 21 0 2
2 21 1 3
3 21 1 4
4 21 1 4
target = pd.DataFrame({'age':5*[21],
'location':[0,0,0,1,2],
'x':5*[np.nan]})
target
age location x
0 21 0 NaN
1 21 0 NaN
2 21 0 NaN
3 21 1 NaN
4 21 2 NaN
Now I need to fill in the missing values of x in the target dataframe by choosing a random value of x from the source dataframe that have the same values for age and location as the missing x with replacement. If there is no value of x in source that has the same values for age and location as the missing value it should be left as missing.
Expected output:
age location x
0 21 0 1 with probability 0.5 2 otherwise
1 21 0 1 with probability 0.5 2 otherwise
2 21 0 1 with probability 0.5 2 otherwise
3 21 1 3 with probability 0.33 4 otherwise
4 21 2 NaN
I can loop through all the missing combinations of age and location and slice the source dataframe and then take a random sample, but my dataset is large enough that it takes quite a while to do.
Is there a better way?
CodePudding user response:
You can create a common grouper and perform a merge
:
cols = ['age', 'location']
(target[cols]
.assign(group=target.groupby(cols).cumcount()) # compute subgroup for duplicates
.merge((# below: assigns a random row group
source.assign(group=source.sample(frac=1).groupby(cols, sort=False).cumcount())
.groupby(cols ['group'], as_index=False) # get one row per group
.first()
),
on=cols ['group'], how='left') # merge
#drop('group', axis=1) # column kept for clarity, uncomment to remove
)
output:
age location group x
0 20 0 0 0.339955
1 20 0 1 0.700506
2 21 0 0 0.777635
3 22 1 0 NaN
CodePudding user response:
You can create unique MultiIndex
by GroupBy.cumcount
if necessary in both DataFrame
s and then use DataFrame.update
with overwrite=False
for replace only missing values:
cols = ['age', 'location']
source1 = source.sample(frac=1)
source1 = source1.set_index(cols [source1.groupby(cols).cumcount()])
target1 = target.set_index(cols [target.groupby(cols).cumcount()])
target1.update(source1, overwrite=False)
target1 = target1.reset_index(level=2, drop=True).reset_index()
print (target1)
age location x
0 21 0 1.0
1 21 0 2.0
2 21 1 2.0
3 21 2 NaN