Home > Back-end >  Compare values between DataFrames and merge columns in Python
Compare values between DataFrames and merge columns in Python

Time:09-22

I have two pandas DataFrames, where the first DataFrame has two columns: "a" and "id" and the second DataFrame has two columns: "id" and "color_value". I would like to compare the ids between the two DataFrames and if there is a match, add a column to the first DataFrame with the correct color_value. In essence, I am using the second DataFrame as a lookup table to find the right ID and match it to its corresponding color value.

See below.

df_1 = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'id': ['001', '002', '001', '004', '003']})
df_2 = pd.DataFrame({'id': ['001', '002', '003', '004'], 'color_value': ['blue', 'red', 'yellow', 'orange']})

so df1:
a  id
1  001
2  002
3  001
4  004
5  003

df2:
id   color_value
001  blue
002  red
003  yellow
004  orange

I need a way to match up the IDs from df1 and df2 and add the color_value column to df1, so the result is:

df1:
a  id   color_value
1  001   blue
2  002   red
3  001   blue
4  004   orange
5  003   yellow

CodePudding user response:

You can form a mapping by setting df_2 with id as index using .set_index() (acts as keys of the mapping) and take the color_value column (acts as values of the mapping).

Then, you can map id in df_1 to the mapping by using .map(), as follows:

df_1['color_value'] =  df_1['id'].map(df_2.set_index('id')['color_value'])

or breaking it down into 2 lines for better readability/ easier understanding, as follows:

mapping = df_2.set_index('id')['color_value']
df_1['color_value'] =  df_1['id'].map(mapping)

Result:

print(df_1)

   a   id color_value
0  1  001        blue
1  2  002         red
2  3  001        blue
3  4  004      orange
4  5  003      yellow

CodePudding user response:

This is exactly what a pandas merge does! There are different types of merges, but the one you're looking for is called a "left" merge. That means that in the final dataframe, all of the rows in your first dataframe will be there, joined with only the rows in your second dataframe that have a matching id.

Imagine that your first dataframe is called the "left" dataframe, and your second one is the "right" dataframe. Your id column acts as the "key," holding values that might match between the two dataframes.

This is how you could use the merge function to accomplish your goal:

merged_df = df_1.merge(df_2, how='left', on='id')

The "how" parameter is where you input what kind of merge you want to do (a left merge). The "on" parameter is where you put the key column name (the id column).

Result:

df_1:
   a   id
0  1  001
1  2  002
2  3  001
3  4  004
4  5  003

df_2:
  color_value   id
0        blue  001
1         red  002
2      yellow  003
3      orange  004

merged_df:
   a   id color_value
0  1  001        blue
1  2  002         red
2  3  001        blue
3  4  004      orange
4  5  003      yellow

For more info:

Docs for the merge function (the examples are helpful): Pandas.DataFrame.merge

And a post with more info about different types of merges: Pandas Merging 101

  • Related