This image would help better:
The column titled passengerId describes the group number and person number, people in the same group are usually families, hence they come from the same planet. So they are some nan rows in the Home planet column and I want to fill it through knowledge about the group number in the PassengerID column.
So I need a code or maybe a loop that'll fill na values in the Home planet column by checking if they're in a group with someone (because they would therefore be in the same homeplanet since they are likely a family) . That's just basically what I need help with Filling the na values in the Homeplanet column by checking the group number and using the Homeplanet of about group member as the replacement for the na value
I've tried running for loops but I didn't even know what parameter to specify. I converted the PassengerId into an array and the same with Homeplanet and tried to other through members but I didn't know how to move forward.
CodePudding user response:
If I understand the description correctly, this example data frame would showcase the problem:
df = pd.DataFrame({'passenger_id': ['1', '1', '2', '2'], 'home_planet': ['3', np.nan, '4', np.nan]})
df
passenger_id | home_planet
0 | 1 | 3
1 | 1 | NaN
2 | 2 | 4
3 | 2 | NaN
where you want the NaN
values to be 3 and 4 based on the value in passenger_id
column.
You can do this with merging the DataFrame with its cleaned and deduplicated self:
pd.merge(df, df.loc[df['home_planet'].notna()].drop_duplicates(),
on='passenger_id', suffixes=('_x', ''))[['passenger_id', 'home_planet']]
passenger_id | home_planet
0 | 1 | 3
1 | 1 | 3
2 | 2 | 4
3 | 2 | 4
Update after the question has been updated
You can extract a GroupId
field from PassengerId
and do what I originally suggested like this:
df = pd.DataFrame({'PassengerId': ['9280_01', '9280_02', '9279_01', '9279_02'],
'HomePlanet': ['Europa', np.nan, 'Earth', np.nan]})
df
PassengerId HomePlanet
0 9280_01 Europa
1 9280_02 NaN
2 9279_01 Earth
3 9279_02 NaN
df['GroupId'] = df['PassengerId'].apply(lambda x: x.split('_')[0])
df
PassengerId HomePlanet GroupId
0 9280_01 Europa 9280
1 9280_02 NaN 9280
2 9279_01 Earth 9279
3 9279_02 NaN 9279
pd.merge(df, df.loc[df['HomePlanet'].notna()].drop_duplicates(),
on='GroupId', suffixes=('_x', ''))[['PassengerId', 'HomePlanet']]
PassengerId HomePlanet
0 9280_01 Europa
1 9280_01 Europa
2 9279_01 Earth
3 9279_01 Earth
If you want to do further checks to determine if the two passengers are indeed from the same family (for example check their names) you can do that in the apply
.