How to fill missing data in large dataframe from other rows for the same id/person?-CodePudding

I have this dataframe:

import pandas as pd 

data = [{'a':2,'b': 2, 'c':3},{'b': 2, 'c':np.nan}, {'a': 10, 'b': 20, 'c': 30}, {'a': 10, 'b': np.nan, 'c': np.nan}]
  
df = pd.DataFrame(data, index =['John', 'John', 'Mike' ,'Mike'])

What I am trying to do is to fill the missing data of every user.

My goal dataframe would be:

data = [{'a':2,'b': 2, 'c':3},{'a':2, 'b': 2, 'c':3}, {'a': 10, 'b': 20, 'c': 30}, {'a': 10, 'b': 20, 'c': 30}]
  
df = pd.DataFrame(data, index =['John', 'John', 'Mike' ,'Mike'])

Now this should be applied for thousands of rows, but I believe this minimalistic example should be fine to accomplish that in a big dataframe.

I do not want to use pd.merge since this would add thousands of columns to my dataframe since my original dataframes have that amount of columns

CodePudding user response：

You can use groupby().transform('first') to extract the first valid values for each user, then fillna:

df = df.fillna(df.groupby(level=0).transform('first'))

Note: You can

replace 'first' with other functions, e.g. 'mean' if you like.
apply the function directly instead of transform: groupby().first(), since you are grouping based on index.