Two seperate similar DataFrames with different lengths
df2
=
Index | First Name | Age | Gender | Weight |
---|---|---|---|---|
0 | James | 25 | Male | 155 |
1 | John | 27 | Male | 175 |
2 | Patricia | 23 | Female | 135 |
3 | Mary | 22 | Female | 125 |
4 | Martin | 30 | Male | 185 |
5 | Margaret | 29 | Female | 141 |
6 | Kevin | 22 | Male | 198 |
df1
=
Index | First Name | Age | Gender | Weight | Height |
---|---|---|---|---|---|
0 | James | 25 | Male | 165 | 5'10 |
1 | John | 27 | Male | 175 | 5'9 |
2 | Matthew | 29 | Male | 183 | 6'0 |
3 | Patricia | 23 | Female | 135 | 5'3 |
4 | Mary | 22 | Female | 125 | 5'4 |
5 | Rachel | 29 | Male | 123 | 5'3 |
6 | Jose | 20 | Male | 175 | 5'11 |
7 | Kevin | 22 | Male | 192 | 6'2 |
df2
has some rows which are not in df1
and df1
has some values which are not in df2
.
I am comparing df1
against df2
. I have calculated the newentries
with the following code
newentries = df2.loc[~df2['First Name'].isin(df1['First Name'])]
deletedentries = df1.loc[~df1['First Name'].isin(df2['First Name'])]
where newentries
denote the rows/entries that are there in df2
but not in df1
; deletedentries
denote the rows/entries that are there in df1
but not in df2
. The above code works perfectly fine.
I need to copy the height from df1
to df2
when the first names are equal.
df2.loc[df2['First Name'].isin(df1['First Name']),"Height"] = df1.loc[dfxh2['First Name'].isin(df2['First Name']),"Height"]
The above code copies the values however indexing is causing an issue and the values are not copied to the corresponding rows, I tried to promote First Name
as the Index but that doesn't solve the issue. Please help me with a solution
Also, I need to calculate the modified values, if the First Name
is same, I need to check for modified values; for example in df1
, the weight of James
is 155 however in df2
the weight is 165, so I need to store the modified weight of James(165) and index(0) in a new dataframe without iteration; the iteration takes a long time because this is a sample of a big dataframe with a lot more rows and columns.
Desired output:
df2
=
Index | First Name | Age | Gender | Weight | Height |
---|---|---|---|---|---|
0 | James | 25 | Male | 155 | 5'10 |
1 | John | 27 | Male | 175 | 5'9 |
2 | Patricia | 23 | Female | 135 | 5'3 |
3 | Mary | 22 | Female | 125 | 5'4 |
4 | Martin | 30 | Male | 185 | |
5 | Margaret | 29 | Female | 141 | |
6 | Kevin | 22 | Male | 198 | 6'2 |
Martin
's and Margaret
's heights are not there in df1
, so their heights are not updated in df2
newentries
=
Index | First Name | Age | Gender | Weight | Height |
---|---|---|---|---|---|
4 | Martin | 30 | Male | 185 | |
5 | Margaret | 29 | Female | 141 |
deletedentries
=
Index | First Name | Age | Gender | Weight | Height |
---|---|---|---|---|---|
2 | Matthew | 29 | Male | 183 | 6'0 |
5 | Rachel | 29 | Male | 123 | 5'3 |
6 | Jose | 20 | Male | 175 | 5'11 |
modval
=
Index | First Name | Age | Gender | Weight | Height |
---|---|---|---|---|---|
0 | James | 165 | |||
7 | Kevin | 192 |
CodePudding user response:
for your desired output for df2
you can try this:
desired_df2 = df2.merge(df1[['First Name','Height']], on='First Name', how='left')
#if you want to change the "NaN" values just add ".fillna(fill_value=0)" for e.g 0 after the merge
print(desired_df2)
First Name Age Gender Weight Height
0 James 25 Male 155 5'10
1 John 27 Male 175 5'9
2 Patricia 23 Female 135 5'3
3 Mary 22 Female 125 5'4
4 Martin 30 Male 185 NaN
5 Margaret 29 Female 141 NaN
6 Kevin 22 Male 198 6'2
new and deleted entries is already right. for the moment I'm a bit stuck how to get the modval dataframe. I'll update my answer if I get a solution.
CodePudding user response:
Building off of Rabinzel's answer:
output = df2.merge(df1, how='left', on='First Name', suffixes=[None, '_old'])
df3 = output[['First Name', 'Age', 'Gender', 'Weight', 'Height']]
cols = df1.columns[1:-1]
modval = pd.DataFrame()
for col in cols:
modval = pd.concat([modval, output[['First Name', col '_old']][output[col] != output[col '_old']].dropna()])
modval.rename(columns={col '_old':col}, inplace=True)
newentries = df2[~df2['First Name'].isin(df1['First Name'])]
deletedentries = df1[~df1['First Name'].isin(df2['First Name'])]
print(df3, newentries, deletedentries, modval, sep='\n\n')
Output:
First Name Age Gender Weight Height
0 James 25 Male 155 5'10
1 John 27 Male 175 5'9
2 Patricia 23 Female 135 5'3
3 Mary 22 Female 125 5'4
4 Martin 30 Male 185 NaN
5 Margaret 29 Female 141 NaN
6 Kevin 22 Male 198 6'2
First Name Age Gender Weight
4 Martin 30 Male 185
5 Margaret 29 Female 141
First Name Age Gender Weight Height
2 Matthew 29 Male 183 6'0
5 Rachel 29 Male 123 5'3
6 Jose 20 Male 175 5'11
First Name Age Gender Weight
0 James NaN NaN 165.0
6 Kevin NaN NaN 192.0