I am currently facing an issue and try to find a solution to it.
I have 2 files that I am trying to merge.
File 1: Contains all the data 15,000 rows with 46 columns. File 2: is my reference file contains 130 rows and 2 columns. (2 columns: NAME and TYPE_OF_CAR)
What I am trying to do is I have "NAME" in file 1 and file 2. I want to add "TYPE_OF_CAR" from file 2 to file 1 based on the "NAME"
File 1:
NAME ...
Car 1 ...
Car 2 ...
...
Car 50 ...
File 2:
NAME TYPE_OF_CAR
Car 1 Sport
Car 2 Sport
Car 50 Regular
In File 1 I would like to have the exact match but if I test with only 2000 rows it is correct and I got the proper match for the TYPE_OF_CAR. But when I test with the all the rows on File 1 for the same value I got as follow:
NAME TYPE_OF_CAR
Car 1 Sport
Car 1 Regular
Car 1
Car 2 Regular
Car 2 Sport
I got 3 different values for the same name Sport, Regular and empty value.
Here is the code:
df1 = pd.read_excel(source_path "cars.xlsx", sheet_name='data')
df2 = pd.read_excel(source_path "Type_of_Car.xlsx", sheet_name='data')
merge_df = df2.merge(df1, on='NAME', how='right')
df1['TYPE_OF_CAR'] = merge_df['TYPE_OF_CAR']
I have no idea what is the problem I tried so many solutions and try with less data it is working but with all data it does not work. Let me know if you have any idea.
CodePudding user response:
Should we use how = left
like the following?
merge_df = df2.merge(df1, on='NAME', how='left')
CodePudding user response:
Thank you so much for your comment Ragas, unfortunately I still got wrong matches for the same value in the name. I still got these type of values: NAME TYPE_OF_CAR Car 1 Sport Car 1 Sport Car 1 Sport Car 1 Car 1 Regular Car 2 Sport Car 2 Regular Car 2 Car 2 Sport Car 2 Sport ... Car 50 Regular Car 50 Regular Car 50 Regular Car 50 Car 50 Sport
I got 3 different values for the same Car Name, this does not make sense I should have all the line for Car 1 = Sport all the lines for Car 2 = Sport and all the lines for Car 50 = Regular.
Also in the file 2 I have sometimes empty values because the Car is not associated to a Type Car. So in the merge file I should have empty for the TYPE_OF_CAR values.
I tried everything left, right, inner, outer but nothing work. I change the source file to see if there was an issue with the source file but nothing changed too.
Questions:
Maybe I am doing something wrong? Maybe the merge does not work properly with a file with 15,000 lines? Files that I am using are .xlsx format, should I use .csv format? Maybe the merge is not the right approach for my case as I have multiple lines with the same name and I should get the same TYPE_OF_CAR values for the matching NAME of the car? I am trying to understand where is the issue coming from.
Thank you if you can answer to my problem.
I also tried with np.where instead of merge as below:
df1['TYPE_OF_CAR'] = np.where(df1['NAME'].equals(df2['NAME']),df2['TYPE_OF_CAR'],"NA")
Then I got the following error: ValueError: Length of values (205) does not match length of index (15300)