Pandas: updating specific values in a dataframe from another dataframe-CodePudding

I am working with a dataframe of around 1000 samples. In a particular column there are approximately 800 samples with valid data and 200 with missing values dispersed irregularly through the population.

I have extracted the samples with missing values and filled with valid data from a different dataset as a csv.

I am trying to update the original dataframe with the corrected values but am finding it more difficult than I believe it should be, but I've been staring at it for long enough that I think I can no longer see the woods for the trees.

To illustrate the problem in code:

import pandas as pd
import numpy as np

df_master = pd.DataFrame({'id':[1, 2, 3, 4, 5, 6],
                          'player_id':[53, 26, 74, 15, 63, 82],
                          'age':[25, np.NaN, 37, 13,53, np.NaN])

ageless = df_master[df_master.age.isna()]
ageless.to_csv('fill_ages.csv')

after filling the ages:

filled_ages = pd.read_csv('filled_ages.csv')

I have tried using a merge:

pd.merge(df_master, filled_ages, on=df_master['player_id'], how='left')

but I get a 'Check Duplicates' error. I've also tried iterating over the dataframes:

for i in df_master.player_id:
    if i in filled_ages.player_id:
        df_master.age[df_master.player_id == i] = filled_ages.age[filled_ages.player_id == i]

but get a 'Copy on a slice' warning. Any ideas what I'm doing wrong?

CodePudding user response：

You can use pd.Series.map like this:

df_master['age'] = df_master['id'].map(filled_ages.set_index('id')['age'])\
                                  .fillna(df_master['age'])

CodePudding user response：

Make sure you read the file's index when reading it:

filled_ages = pd.read_csv('filled_ages.csv', index_col=0)

Then you can just fill in the NaNs using the existing indices (assuming you didn't change them):

df_master= df_master.fillna(filled_ages)

It will update the NaN values with values from the filled_ages dataframe with the same index. This will work as long as you preserved the index.

If you didn't preserve the index, and assuming that player_id is unique, you can use that as the index:

master_df = master_df.set_index('player_id')
filled_ages = filled_ages.set_index('player_id')

master_df["age"] = master_df["age"].fillna(filled_ages["age"])
master_df = master_df.reset_index()