How to fill NaNs by making groups?-CodePudding

I have a dataframe like this:

    ID  Age Gender  Date
0   1   18  Male    NaN
1   1   18  Male    NaN
2   1   18  Male    2016-03-11
3   2   22  Male    NaN
4   2   22  Male    NaN
5   4   25  Male    NaN
6   4   25  Male    NaN
7   4   25  Male    NaN
8   4   25  Male    2017-04-27

There are some NaN values in Date column, I want to fill those NaNs using their respective ID. For example:

ID = 1 has occurred 3 times in dataframe, and the Date is given only once (2016-03-11). I want to fill the remaining two NaNs with that same date.
The same goes to ID = 4, it occurred 4 times and Date is showed only for one. In short, I want resulting dataframe like this:

    ID  Age Gender  Date
0   1   18  Male    2016-03-11
1   1   18  Male    2016-03-11
2   1   18  Male    2016-03-11
3   2   22  Male    NaN
4   2   22  Male    NaN
5   4   25  Male    2017-04-27
6   4   25  Male    2017-04-27
7   4   25  Male    2017-04-27
8   4   25  Male    2017-04-27

I tried groupby method, but I am not getting any results. Can you please tell me how to that in Python?

Here is a CSV file, if you want to try it on your local machine:

ID, Age, Gender, Date
1, 18, Male,
1, 18, Male,
1, 18, Male, 2016-03-11
2, 22, Male,
2, 22, Male,
4, 25, Male,
4, 25, Male,
4, 25, Male,
4, 25, Male, 2017-04-27

CodePudding user response：

You can use groupby and then apply to back and forward fill the ID groups. Not sure if this will give you the desired result if an ID had more than 1 date.

df.groupby('ID').apply(lambda group:group.bfill().ffill())

Output:

   ID  Age Gender        Date
0   1   18   Male  2016-03-11
1   1   18   Male  2016-03-11
2   1   18   Male  2016-03-11
3   2   22   Male         NaN
4   2   22   Male         NaN
5   4   25   Male  2017-04-27
6   4   25   Male  2017-04-27
7   4   25   Male  2017-04-27
8   4   25   Male  2017-04-27

CodePudding user response：

If you use

non_nan = df[['ID', 'Date']].dropna().drop_duplicates()

then you have the non-nan dates

Now you can merge

pd.merge(df[['ID', 'Age', 'Gender']], non_nan)

Note that if you have multiple distinct non-nan dates per group, you will get a row for each (your question didn't specify what to do in this case).