Home > Software engineering >  How to fill missing values in a column by random sampling another column by other column values
How to fill missing values in a column by random sampling another column by other column values

Time:09-23

I have missing values in one column that I would like to fill by random sampling from a source distribution:

import pandas as pd
import numpy as np
source = pd.DataFrame({'age':5*[21],
                       'location':[0,0,1,1,1],
                       'x':[1,2,3,4,4]})
source

    age location    x
0   21  0   1
1   21  0   2
2   21  1   3
3   21  1   4
4   21  1   4

target = pd.DataFrame({'age':5*[21],
                       'location':[0,0,0,1,2],
                       'x':5*[np.nan]})
target
    age location    x
0   21  0   NaN
1   21  0   NaN
2   21  0   NaN
3   21  1   NaN
4   21  2   NaN

Now I need to fill in the missing values of x in the target dataframe by choosing a random value of x from the source dataframe that have the same values for age and location as the missing x with replacement. If there is no value of x in source that has the same values for age and location as the missing value it should be left as missing.

Expected output:

    age location    x
0   21  0   1 with probability 0.5 2 otherwise
1   21  0   1 with probability 0.5 2 otherwise
2   21  0   1 with probability 0.5 2 otherwise
3   21  1   3 with probability 0.33 4 otherwise
4   21  2   NaN

I can loop through all the missing combinations of age and location and slice the source dataframe and then take a random sample, but my dataset is large enough that it takes quite a while to do.

Is there a better way?

CodePudding user response:

You can create a common grouper and perform a merge:

cols = ['age', 'location']


(target[cols]
       .assign(group=target.groupby(cols).cumcount())               # compute subgroup for duplicates
       .merge((# below: assigns a random row group
               source.assign(group=source.sample(frac=1).groupby(cols, sort=False).cumcount())
                     .groupby(cols ['group'], as_index=False)       # get one row per group
                     .first()
              ),
              on=cols ['group'], how='left')    # merge
       #drop('group', axis=1) # column kept for clarity, uncomment to remove
)

output:

   age  location  group         x
0   20         0      0  0.339955
1   20         0      1  0.700506
2   21         0      0  0.777635
3   22         1      0       NaN

CodePudding user response:

You can create unique MultiIndex by GroupBy.cumcount if necessary in both DataFrames and then use DataFrame.update with overwrite=False for replace only missing values:

cols = ['age', 'location']
source1 = source.sample(frac=1)

source1 = source1.set_index(cols   [source1.groupby(cols).cumcount()])
target1 = target.set_index(cols   [target.groupby(cols).cumcount()])

target1.update(source1, overwrite=False)
target1 = target1.reset_index(level=2, drop=True).reset_index()
print (target1)
   age  location    x
0   21         0  1.0
1   21         0  2.0
2   21         1  2.0
3   21         2  NaN
  • Related