I am a beginner working with a clinical data set using Pandas in Jupyter Notebook.
A column of my data contains census tract codes and I am trying to merge my data with a large transportation data file that also has a column with census tract codes.
I initially only wanted 2 of the other columns from that transportation file so, after I downloaded the file, I removed all of the other columns except the 2 that I wanted to add to my file and the census tract column.
This is the code I used:
df_my_data = pd.read_excel("my_data.xlsx")
df_transportation_data = pd.read_excel("transportation_data.xlsx")
df_merged_file = pd.merge(df_my_data, df_transportation_data)
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
This worked but then I wanted to add the other columns from the transportation file so I used my initial file (prior to adding the 2 transportation columns) and tried to merge the entire transportation file. This resulted in a new DataFrame with all of the desired columns but only 4 rows.
I thought maybe the transportation file is too big so I tried merging individual columns (other than the 2 I was initially able to merge) and this again results in all of the correct columns but only 4 rows merging.
Any help would be much appreciated.
Edits: Sorry for not being more clear.
Here is the code for the 2 initial columns I merged:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_two_columns = pd.read_excel('two_columns_from_transportation_file.xlsx')
df_two_columns_merged = pd.merge(df_my_data, df_two_columns, on=['census_tract'])
df_two_columns_merged.to_excel('two_columns_merged.xlsx', index = False)
The outputs were:
df_my_data.head()
census_tract id e t
0 6037408401 1 1 1092
1 6037700200 2 1 1517
2 6065042740 3 1 2796
3 6037231210 4 1 1
4 6059076201 5 1 41
df_two_columns.head()
census_tract households_with_no_vehicle vehicles_per_household
0 6001400100 2.16 2.08
1 6001400200 6.90 1.50
2 6001400300 17.33 1.38
3 6001400400 8.97 1.41
4 6001400500 11.59 1.39
df_two_columns_merged.head()
census_tract id e t households_with_no_vehicle vehicles_per_household
0 6037408401 1 1 1092 4.52 2.43
1 6037700200 2 1 1517 9.88 1.26
2 6065042740 3 1 2796 2.71 1.49
3 6037231210 4 1 1 25.75 1.35
4 6059076201 5 1 41 1.63 2.22
df_my_data has 657 rows and df_two_columns_merged came out with 657 rows.
The code for when I tried to merge the entire transport file:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_transportation_data = pd.read_excel('transportation_data.xlsx')
df_merged_file = pd.merge(df_my_data, df_transportation_data, on=['census_tract'])
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
The output:
df_transportation_data.head()
census_tract Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6001400100 0.00 12.60 65.95 2.16 20.69 0.76 2.08
1 6001400200 5.68 3.66 45.79 6.90 39.01 5.22 1.50
2 6001400300 7.55 6.61 46.77 17.33 31.19 6.39 1.38
3 6001400400 8.85 11.29 43.91 8.97 27.67 4.33 1.41
4 6001400500 8.45 7.45 46.94 11.59 29.56 4.49 1.39
df_merged_file.head()
census_tract id e t Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6041119100 18 0 2755 1.71 3.02 82.12 4.78 8.96 3.32 2.10
1 6061023100 74 1 1201 0.00 9.85 86.01 0.50 2.43 1.16 2.22
2 6041110100 80 1 9 0.30 4.40 72.89 6.47 13.15 7.89 1.82
3 6029004902 123 0 1873 0.00 18.38 78.69 4.12 0.00 0.00 2.40
The df_merged_file only has 4 total rows.
So my question is: why is it that I am able to merge those initial 2 columns from the transportation file and keep all of the rows from my file but when I try to merge the entire transportation file I only get 4 rows of output?
CodePudding user response:
I recommend specifying merge type and merge column(s).
When you use pd.merge()
, the default merge type is inner merge, and on the same named columns using:
df_merged_file = pd.merge(df_my_data, df_transportation_data, how='left', left_on=[COLUMN], right_on=[COLUMN])
It is possible that one of the columns you removed from the "transportation_data.xlsx"
file previously is the same name as a column in your "my_data.xlsx"
, causing unmatched rows to be removed due to an inner merge.
A 'left'
merge would allow the two columns you need from "transportation_data.xlsx"
to attach to values in your "my_data.xlsx"
, but only where there is a match. This means your merged DataFrame will have the same number of rows as your "my_data.xlsx"
has currently.
CodePudding user response:
Well, I think there was something wrong with the initial download of the transportation file. I downloaded it again and this time I was able to get a complete merge. Sorry for being an idiot. Thank you all for your help.