Home > Software engineering >  I need a way to compare two strings in python without using sets in a pandas dataframe
I need a way to compare two strings in python without using sets in a pandas dataframe

Time:11-19

I am currently working on a huge csv file with pandas, and I need to find and print similarity between the selected row and every other row. For example if the string is "Card" and the second string is "Credit Card Debit Card" it should return 2 or if the first string is "Credit Card" and the second string is "Credit Card Debit Card" it should return 3 because 3 of the words match with the first string. I tried solving this using sets but because of sets being unique and not containing duplicates in the first example instead of 2 it returns 1. Because in a set "Credit Card Debit Card" is {"Credit", "Card", "Debit"}. Is there anyway that I can calculate this? The formula of similarity is ((numberOfSameWords)/whichStringisLonger)*100 as explained in this photo:

I tried so many things like Jaccard Similarity but they all work with sets and return wrong answers. Thanks for any help. The code I tried running:

def test(row1, row2):
    return str(round(len(np.intersect1d(row1.split(), row2.split())) / max(len(row1.split()), len(row2.split()))*100, 2))

data = int(input("Which index should be tested:"))
for j in range(0,10):
    print(test(dff['Product'].iloc[data], dff['Product'].iloc[j]))

and my dataframe currently looks like this:

print(df.sample(10).to_dict("list")) returned me:

{'Product': ['Bank account or service', 'Credit card', 'Credit reporting', 'Credit reporting credit repair services or other personal consumer reports', 'Credit reporting', 'Mortgage', 'Debt collection', 'Mortgage', 'Mortgage', 'Credit reporting'], 'Issue': ['Deposits and withdrawals', 'Billing disputes', 'Incorrect information on credit report', "Problem with a credit reporting company's investigation into an existing problem", 'Incorrect information on credit report', 'Applying for a mortgage or refinancing an existing mortgage', 'Disclosure verification of debt', 'Loan servicing payments escrow account', 'Loan servicing payments escrow account', 'Incorrect information on credit report'], 'Company': ['CITIBANK NA', 'FIRST NATIONAL BANK OF OMAHA', 'EQUIFAX INC', 'Experian Information Solutions Inc', 'Experian Information Solutions Inc', 'BANK OF AMERICA NATIONAL ASSOCIATION', 'AllianceOne Recievables Management', 'SELECT PORTFOLIO SERVICING INC', 'OCWEN LOAN SERVICING LLC', 'Experian Information Solutions Inc'], 'State': ['CA', 'WA', 'FL', 'UT', 'MI', 'CA', 'WA', 'IL', 'TX', 'CA'], 'ZIP_code': ['92606', '98272', '329XX', '84321', '486XX', '94537', '984XX', '60473', '76247', '91401'], 'Complaint_ID': [90452, 2334443, 1347696, 2914771, 1788024, 2871939, 1236424, 1619712, 2421373, 1803691]}

CodePudding user response:

You can use numpy.intersect1d to get the common words but the % is different for the third row.

import numpy as np

df["Similarity_%"] = (
                        df.apply(lambda x: "%"   str(round(len(np.intersect1d(x['Col1'].split(), x['Col2'].split()))
                                                          / max(len(x["Col1"].split()), len(x["Col2"].split()))
                                                          *100, 2)), axis=1)
                     )

# Output :

print(df)
                                                                Col1                                  Col2 Similarity_%
0                                                    Debt collection                       Debt collection       0.0
1                                                    Debt collection                              Mortgage         %0.0
2                                                Managing loan lease               Problems end loan lease        P.0
3                                                Managing loan lease                   Struggling pay loan       3.33
4  Credit reporting credit repair services personal consumer reports  Payday loan title loan personal loan        .5

# Input used:

import pandas as pd

df= pd.DataFrame({'Col1': ['Debt collection', 'Debt collection', 'Managing loan lease', 'Managing loan lease', 
                           'Credit reporting credit repair services personal consumer reports'],
                  'Col2': ['Debt collection', 'Mortgage', 'Problems end loan lease', 'Struggling pay loan',
                           'Payday loan title loan personal loan']})

CodePudding user response:

You can try this:

import pandas as pd

l1 = ["Debt collection", "Debt collection", "Managing loan lease", "Managing loan lease",
      "Credit reporting credit repair services personal consumer reports"]
l2 = ["Debt collection", "Mortgage", "Problems end loan lease", "Struggling pay loan",
      "Payday loan title loan personal loan"]

df = pd.DataFrame(l1, columns=["col1"])
df["col2"] = l2


def similarity(row1, row2):
    # calculate longest row
    longestSentence = 0
    commonWords = 0
    wordsRow1 = row1.split()
    wordsRow2 = row2.split()
    # calculate similar words in both sentences
    common = list(set(wordsRow1).intersection(wordsRow2))
    if len(wordsRow1) > len(wordsRow2):
        longestSentence = len(wordsRow1)
        commonWords = calculate(common, wordsRow1)
    else:
        longestSentence = len(wordsRow2)
        commonWords = calculate(common, wordsRow2)
    return (commonWords / longestSentence) * 100

def calculate(common, longestRow):
    sum = 0
    for word in common:
        sum  = longestRow.count(word)
    return sum

df['similarity'] = df.apply(lambda x: similarity(x.col1, x.col2), axis=1)

print(df)
  • Related