I have a dataframe like this:
ID Age Gender Date
0 1 18 Male NaN
1 1 18 Male NaN
2 1 18 Male 2016-03-11
3 2 22 Male NaN
4 2 22 Male NaN
5 4 25 Male NaN
6 4 25 Male NaN
7 4 25 Male NaN
8 4 25 Male 2017-04-27
There are some NaN values in Date
column, I want to fill those NaNs using their respective ID. For example:
ID = 1
has occurred 3 times in dataframe, and the Date is given only once (2016-03-11). I want to fill the remaining two NaNs with that same date.
The same goes to ID = 4
, it occurred 4 times and Date is showed only for one. In short, I want resulting dataframe like this:
ID Age Gender Date
0 1 18 Male 2016-03-11
1 1 18 Male 2016-03-11
2 1 18 Male 2016-03-11
3 2 22 Male NaN
4 2 22 Male NaN
5 4 25 Male 2017-04-27
6 4 25 Male 2017-04-27
7 4 25 Male 2017-04-27
8 4 25 Male 2017-04-27
I tried groupby
method, but I am not getting any results. Can you please tell me how to that in Python?
Here is a CSV file, if you want to try it on your local machine:
ID, Age, Gender, Date
1, 18, Male,
1, 18, Male,
1, 18, Male, 2016-03-11
2, 22, Male,
2, 22, Male,
4, 25, Male,
4, 25, Male,
4, 25, Male,
4, 25, Male, 2017-04-27
CodePudding user response:
You can use groupby and then apply to back and forward fill the ID groups. Not sure if this will give you the desired result if an ID had more than 1 date.
df.groupby('ID').apply(lambda group:group.bfill().ffill())
Output:
ID Age Gender Date
0 1 18 Male 2016-03-11
1 1 18 Male 2016-03-11
2 1 18 Male 2016-03-11
3 2 22 Male NaN
4 2 22 Male NaN
5 4 25 Male 2017-04-27
6 4 25 Male 2017-04-27
7 4 25 Male 2017-04-27
8 4 25 Male 2017-04-27
CodePudding user response:
If you use
non_nan = df[['ID', 'Date']].dropna().drop_duplicates()
then you have the non-nan dates
Now you can merge
pd.merge(df[['ID', 'Age', 'Gender']], non_nan)
Note that if you have multiple distinct non-nan dates per group, you will get a row for each (your question didn't specify what to do in this case).