Home > Mobile >  Fill NAN values during Pandas merge
Fill NAN values during Pandas merge

Time:07-27

I have a little problem but I don't find solutions...I want to merge two dataframes and I want to fill in the NaN values using the information from each dataframes... Consider two dataframes :

  • First people :
id skills diploma
1 nan omicron
2 beta nan
5 beta epsilon
6 omega zeta
  • and jobs :
id people_id skills diploma
10 1 alpha nan
20 2 nan zeta
30 3 beta pi
40 4 omega zeta

When I use Pandas merge :

pd.merge(people, jobs, left_on='id', right_on='people_id', how='inner')

and I have this result :

id_x skills_x diploma_x id_y people_id skills_y diploma_y
1 nan omicron 10 1 alpha nan
2 beta nan 20 2 nan zeta

I want to produce via Pandas merge a new dataframe where each line is a job and i want to fill NaN value, keep the job's id (because each line is a job) and keep one time the columns "skills" and "diploma", like this :

id people_id skills diploma
10 1 alpha omicron
20 2 beta zeta

To code to reproduce :

import pandas as pd
import numpy as np

people = pd.DataFrame({'id':[1,2,5,6], 'skills': [np.nan, "beta", "beta", "omega"], 'diploma': ["omicron", np.nan, "epsilon", "zeta"]})
jobs = pd.DataFrame({'id':[10,20,30,40], 'people_id':[1,2,3,4], 'skills': ["alpha", np.nan, "beta", "omega"], 'diploma': [np.nan, "zeta", "pi", "zeta"]})

CodePudding user response:

First merge the dataframe:

merged_df = pd.merge(
    people.rename(columns={"id":"people_id"}), #renaming the id of people table
    jobs,
    on='people_id', how='inner', suffixes=[".people", ".jobs"]
).set_index(["people_id","id"])

#creating the second columns level
merged_df.columns = merged_df.columns.str.split(".", expand=True) 

Now merged_df is:

                skills  diploma skills  diploma
                people  people  jobs    jobs
people_id   id              
1           10  NaN     omicron alpha   NaN
2           20  beta    NaN     NaN     zeta

Then your task can be solved by the following lines:

merged_df["skills"] = merged_df["skills"].apply(lambda x: pd.Series.dropna(x)[0], axis=1)
merged_df["diploma"] = merged_df["diploma"].apply(lambda x: pd.Series.dropna(x)[0], axis=1)

#remove the second level (we don't need it anymore)
merged_df.columns = merged_df.columns.droplevel(1)

#remove duplicate columns
merged_df = merged_df.loc[:,merged_df.columns.duplicated()]

Finally merged_df is:

                skills  diploma
people_id   id
1           10  alpha   omicron
2           20  beta    zeta

If you prefer you can reset_index().

PS: Note that if skills or diploma have NaN value in both dataframes for a person, then the algorithm will rise an error.

CodePudding user response:

The combine_first method is the one you are looking for friend

filled_people = people.combine_first(jobs)

Just be carefull with your indexes and make sure they match, like column A to be filled has the same name in your jobs df.

  • Related