Home > Mobile >  How to fill missing data in large dataframe from other rows for the same id/person?
How to fill missing data in large dataframe from other rows for the same id/person?

Time:11-05

I have this dataframe:

import pandas as pd 

data = [{'a':2,'b': 2, 'c':3},{'b': 2, 'c':np.nan}, {'a': 10, 'b': 20, 'c': 30}, {'a': 10, 'b': np.nan, 'c': np.nan}]
  
df = pd.DataFrame(data, index =['John', 'John', 'Mike' ,'Mike'])
  

What I am trying to do is to fill the missing data of every user.

My goal dataframe would be:

data = [{'a':2,'b': 2, 'c':3},{'a':2, 'b': 2, 'c':3}, {'a': 10, 'b': 20, 'c': 30}, {'a': 10, 'b': 20, 'c': 30}]
  
df = pd.DataFrame(data, index =['John', 'John', 'Mike' ,'Mike'])

Now this should be applied for thousands of rows, but I believe this minimalistic example should be fine to accomplish that in a big dataframe.

I do not want to use pd.merge since this would add thousands of columns to my dataframe since my original dataframes have that amount of columns

CodePudding user response:

You can use groupby().transform('first') to extract the first valid values for each user, then fillna:

df = df.fillna(df.groupby(level=0).transform('first'))

Note: You can

  1. replace 'first' with other functions, e.g. 'mean' if you like.
  2. apply the function directly instead of transform: groupby().first(), since you are grouping based on index.
  • Related