I have a little problem but I don't find solutions...I want to merge two dataframes and I want to fill in the NaN values using the information from each dataframes... Consider two dataframes :
- First people :
id | skills | diploma |
---|---|---|
1 | nan | omicron |
2 | beta | nan |
5 | beta | epsilon |
6 | omega | zeta |
- and jobs :
id | people_id | skills | diploma |
---|---|---|---|
10 | 1 | alpha | nan |
20 | 2 | nan | zeta |
30 | 3 | beta | pi |
40 | 4 | omega | zeta |
When I use Pandas merge :
pd.merge(people, jobs, left_on='id', right_on='people_id', how='inner')
and I have this result :
id_x | skills_x | diploma_x | id_y | people_id | skills_y | diploma_y |
---|---|---|---|---|---|---|
1 | nan | omicron | 10 | 1 | alpha | nan |
2 | beta | nan | 20 | 2 | nan | zeta |
I want to produce via Pandas merge a new dataframe where each line is a job and i want to fill NaN value, keep the job's id (because each line is a job) and keep one time the columns "skills" and "diploma", like this :
id | people_id | skills | diploma |
---|---|---|---|
10 | 1 | alpha | omicron |
20 | 2 | beta | zeta |
To code to reproduce :
import pandas as pd
import numpy as np
people = pd.DataFrame({'id':[1,2,5,6], 'skills': [np.nan, "beta", "beta", "omega"], 'diploma': ["omicron", np.nan, "epsilon", "zeta"]})
jobs = pd.DataFrame({'id':[10,20,30,40], 'people_id':[1,2,3,4], 'skills': ["alpha", np.nan, "beta", "omega"], 'diploma': [np.nan, "zeta", "pi", "zeta"]})
CodePudding user response:
First merge the dataframe:
merged_df = pd.merge(
people.rename(columns={"id":"people_id"}), #renaming the id of people table
jobs,
on='people_id', how='inner', suffixes=[".people", ".jobs"]
).set_index(["people_id","id"])
#creating the second columns level
merged_df.columns = merged_df.columns.str.split(".", expand=True)
Now merged_df
is:
skills diploma skills diploma
people people jobs jobs
people_id id
1 10 NaN omicron alpha NaN
2 20 beta NaN NaN zeta
Then your task can be solved by the following lines:
merged_df["skills"] = merged_df["skills"].apply(lambda x: pd.Series.dropna(x)[0], axis=1)
merged_df["diploma"] = merged_df["diploma"].apply(lambda x: pd.Series.dropna(x)[0], axis=1)
#remove the second level (we don't need it anymore)
merged_df.columns = merged_df.columns.droplevel(1)
#remove duplicate columns
merged_df = merged_df.loc[:,merged_df.columns.duplicated()]
Finally merged_df
is:
skills diploma
people_id id
1 10 alpha omicron
2 20 beta zeta
If you prefer you can reset_index()
.
PS: Note that if skills
or diploma
have NaN
value in both dataframes for a person, then the algorithm will rise an error.
CodePudding user response:
The combine_first method is the one you are looking for friend
filled_people = people.combine_first(jobs)
Just be carefull with your indexes and make sure they match, like column A to be filled has the same name in your jobs df.