I am working with a dataframe of around 1000 samples. In a particular column there are approximately 800 samples with valid data and 200 with missing values dispersed irregularly through the population.
I have extracted the samples with missing values and filled with valid data from a different dataset as a csv.
I am trying to update the original dataframe with the corrected values but am finding it more difficult than I believe it should be, but I've been staring at it for long enough that I think I can no longer see the woods for the trees.
To illustrate the problem in code:
import pandas as pd
import numpy as np
df_master = pd.DataFrame({'id':[1, 2, 3, 4, 5, 6],
'player_id':[53, 26, 74, 15, 63, 82],
'age':[25, np.NaN, 37, 13,53, np.NaN])
ageless = df_master[df_master.age.isna()]
ageless.to_csv('fill_ages.csv')
after filling the ages:
filled_ages = pd.read_csv('filled_ages.csv')
I have tried using a merge:
pd.merge(df_master, filled_ages, on=df_master['player_id'], how='left')
but I get a 'Check Duplicates' error. I've also tried iterating over the dataframes:
for i in df_master.player_id:
if i in filled_ages.player_id:
df_master.age[df_master.player_id == i] = filled_ages.age[filled_ages.player_id == i]
but get a 'Copy on a slice' warning. Any ideas what I'm doing wrong?
CodePudding user response:
You can use pd.Series.map like this:
df_master['age'] = df_master['id'].map(filled_ages.set_index('id')['age'])\
.fillna(df_master['age'])
CodePudding user response:
Make sure you read the file's index when reading it:
filled_ages = pd.read_csv('filled_ages.csv', index_col=0)
Then you can just fill in the NaNs using the existing indices (assuming you didn't change them):
df_master= df_master.fillna(filled_ages)
It will update the NaN values with values from the filled_ages
dataframe with the same index. This will work as long as you preserved the index.
If you didn't preserve the index, and assuming that player_id
is unique, you can use that as the index:
master_df = master_df.set_index('player_id')
filled_ages = filled_ages.set_index('player_id')
master_df["age"] = master_df["age"].fillna(filled_ages["age"])
master_df = master_df.reset_index()