Home > Blockchain >  Compare two dataframes columns and find string similiraties
Compare two dataframes columns and find string similiraties

Time:10-30

Im in a project using pyhton where i have two dataframes

Product Name Cost
Car with batteries 2
Headphones Sony 3
Product Name Cost
Car batteries 2
Headphones Sony 3

What i have to do is compare this dataframes and put a column that said 'Matches' But like you see it have to compare Car with batteries and Car batteries and put matches because they are the same product. It is possible?.

CodePudding user response:

Assuming the first dataframe is df1 and the second one is df2, you can:

  • Create a new column in the dataframe of your choice (either df1, df2 or even a new one)
  • Call this new column "Matched"
  • Populate the new column values with the results from the condition: populate with True if the condition is true, otherwise populate with False:
    df1["Matches"] = df1["Cost"] ==  df1["Cost"] 

This will create a new column of True/False values based on whether the battery costs of the same rows in the 2 dataframes match or not (assumes that df1 and df2 have the same number of rows).

CodePudding user response:

You could use fuzzywuzzy to find the similarity of the values. Using for example fuzz.ratio(str1, str2) will return you an indicator how similar the strings are. There are other methods as well. You would have to find out yourself what suits best for your use case.


The example below uses fuzz.ratio(str1, str2) and considers a ratio of 80 to be equals:

# pip install fuzzywuzzy python-levenshtein
# or: conda install -c conda-forge fuzzywuzzy python-levenshtein

import io
import pandas as pd

from fuzzywuzzy import fuzz

df1 = pd.read_csv(io.StringIO("""
Product Name, Cost
Car with batteries, 2
Headphones Sony, 3
"""))

df2 = pd.read_csv(io.StringIO("""
Product Name, Cost
Car batteries, 2
Headphones Sony, 3
"""))
COLUMN_NAME = "Product Name"
ACCEPTED_RATIO = 80

def match(right, left):
    return fuzz.ratio(right, left) > ACCEPTED_RATIO

rsuffix = "_r"
compared = df1.join(df2, rsuffix=rsuffix)
compared["Matches"] = compared.apply(
    lambda x: match(x[COLUMN_NAME], x[f"{COLUMN_NAME}{rsuffix}"]),
    axis=1,
)
compared = compared.drop(
    [c for c in compared.columns if c.endswith(rsuffix)],
    axis=1
)

And the output of print(compared) would be:

          Product Name  Cost    Matches
0   Car with batteries     2       True
1      Headphones Sony     3       True
  • Related