I have two data frames df_1 and df_2 with an uneven number of rows and columns in both. My df_1 looks like this
ID Date QN RWS RS WI eor
0 00603 202206200000 2 0 0.0 0 eor
1 00603 202206200010 2 0 0.0 0 eor
2 00603 202206200020 2 0 0.0 0 eor
.. ... ... ... ... ... ... ...
230007 19140 202206202340 2 0 0.0 0 eor
230008 19171 202206202350 2 0 0.0 0 eor
and df_2 looks like this
ID Start_Date End_Date Station Latitude Longitude
0 00020 20040813 20220619 432 48.9219 9.9129
1 00040 20070209 20220620 44 47.9219 8.0129
2 00603 20070209 20220620 45 46.8319 7.5129
.. ... ... ... ... ... ...
950 15813 20220301 20220619 40 48.9319 8.9129
951 19140 20201101 20220619 278 47.5310 9.0124
952 19171 20200901 20220619 13 49.9013 7.0809
I do not have the latitude, longitude, Start_Date, and End_Date columns in the df_1, therefore, I want to add these columns on the basis of the ID values i.e. a row in df_1 having an ID number as the same in df_2 must have the exact same latitude, longitude, start_date and end_date of that ID as in df_2.
How can I achieve this? by using for loop with row iteration?
I have this logic in my mind but I cannot perform this through coding.
if 'ID' in df_1 == 'ID' in df_2:
then add latitude, longitude, start_date, and end_date columns to the df_1 from df_2.
Any help will be highly appreciated. Thanks
CodePudding user response:
You can use DataFrame.join()
like this:
df_1 = df_1.join(df_2.set_index('ID')[['Start_date','End_date','Latitude','Longitude']], on='ID')
Test code:
df_1 = pd.DataFrame({'ID':['00603','00603','00603'],'col_from_df1':[2,2,2]})
df_2 = pd.DataFrame({'ID':['00020','00040','00603'],'Start_date':[20040813,20070209,20070209],'End_date':[20220619,20220620,20220620],'Station':[432,44,45],
'Latitude':[48.9219,47.9219,46.8319],'Longitude':[9.9129,8.0129,7.5129]})
print(df_1)
print(df_2)
df_1 = df_1.join(df_2.set_index('ID')[['Start_date','End_date','Latitude','Longitude']], on='ID')
print(df_1)
Input:
df_1:
ID col_from_df1
0 00603 2
1 00603 2
2 00603 2
df_2:
ID Start_date End_date Station Latitude Longitude
0 00020 20040813 20220619 432 48.9219 9.9129
1 00040 20070209 20220620 44 47.9219 8.0129
2 00603 20070209 20220620 45 46.8319 7.5129
Output:
ID col_from_df1 Start_date End_date Latitude Longitude
0 00603 2 20070209 20220620 46.8319 7.5129
1 00603 2 20070209 20220620 46.8319 7.5129
2 00603 2 20070209 20220620 46.8319 7.5129
CodePudding user response:
You can achieve this using the merge/join functionality of pandas.
Here is an example:
import pandas as pd
df1 = pd.DataFrame({
"ID": [1, 2, 3, 4, 5],
"values": [10, 20, 30, 40, 50]
})
df2 = pd.DataFrame({
"ID": [2, 4],
"Lat": ["Foo", "Bar"],
"Long": ["Bar", "Foo"],
"Extra1": [1, 2],
"Extra2": [3, 4]
})
new_df1 = pd.merge(
left=df1,
right=df2.filter(["ID", "Lat", "Long"]), # Filter out unnecessary cols prior to merge
how="left", # We are adding TO df1 FROM df2
on="ID" # What the merge column is
).reset_index() # Optional, keep if you want ID to be reset back into a column
print(new_df1)
Result:
ID values Lat Long
0 1 10 NaN NaN
1 2 20 Foo Bar
2 3 30 NaN NaN
3 4 40 Bar Foo
4 5 50 NaN NaN
Notice the cells of df1
are unchanged while the values from df2
have been transferred over where IDs match. NaNs fill cells which couldn't be matched on ID.