Home > Back-end >  How to fill na values of a column by checking another column
How to fill na values of a column by checking another column

Time:04-04

This image would help better:

This IS a screenshot of the data here

The column titled passengerId describes the group number and person number, people in the same group are usually families, hence they come from the same planet. So they are some nan rows in the Home planet column and I want to fill it through knowledge about the group number in the PassengerID column.

So I need a code or maybe a loop that'll fill na values in the Home planet column by checking if they're in a group with someone (because they would therefore be in the same homeplanet since they are likely a family) . That's just basically what I need help with Filling the na values in the Homeplanet column by checking the group number and using the Homeplanet of about group member as the replacement for the na value

I've tried running for loops but I didn't even know what parameter to specify. I converted the PassengerId into an array and the same with Homeplanet and tried to other through members but I didn't know how to move forward.

CodePudding user response:

If I understand the description correctly, this example data frame would showcase the problem:

df = pd.DataFrame({'passenger_id': ['1', '1', '2', '2'], 'home_planet': ['3', np.nan, '4', np.nan]})
df

   passenger_id | home_planet
0 | 1           |  3
1 | 1           |  NaN
2 | 2           |  4
3 | 2           |  NaN

where you want the NaN values to be 3 and 4 based on the value in passenger_id column.

You can do this with merging the DataFrame with its cleaned and deduplicated self:

pd.merge(df, df.loc[df['home_planet'].notna()].drop_duplicates(), 
on='passenger_id', suffixes=('_x', ''))[['passenger_id', 'home_planet']]


   passenger_id | home_planet
0 | 1           |  3
1 | 1           |  3
2 | 2           |  4
3 | 2           |  4

Update after the question has been updated

You can extract a GroupId field from PassengerId and do what I originally suggested like this:

df = pd.DataFrame({'PassengerId': ['9280_01', '9280_02', '9279_01', '9279_02'], 
'HomePlanet': ['Europa', np.nan, 'Earth', np.nan]})
df

  PassengerId HomePlanet
0     9280_01     Europa
1     9280_02        NaN
2     9279_01      Earth
3     9279_02        NaN

df['GroupId'] = df['PassengerId'].apply(lambda x: x.split('_')[0])
df

  PassengerId HomePlanet GroupId
0     9280_01     Europa    9280
1     9280_02        NaN    9280
2     9279_01      Earth    9279
3     9279_02        NaN    9279

pd.merge(df, df.loc[df['HomePlanet'].notna()].drop_duplicates(), 
on='GroupId', suffixes=('_x', ''))[['PassengerId', 'HomePlanet']]

  PassengerId HomePlanet
0     9280_01     Europa
1     9280_01     Europa
2     9279_01      Earth
3     9279_01      Earth

If you want to do further checks to determine if the two passengers are indeed from the same family (for example check their names) you can do that in the apply.

  • Related