How to perform sequence matcher on dataframe values in a row in Python?-CodePudding

New to Python, so kind of figuring things out. I have a dataframe from an excel spreadsheet.

Something like this:

	MANUFACTURER	MANUFACTURER PART NUMBER 1
0	Bourns	3266W-1-100
1	Bourns	CAT16-680J4LF
2	Bourns	CR0603AFX-1000ELF
3	Bourns	CR0603-FX-2701ELF
4	Bourns	CR0603-FX-30R0ELF
5	CTS Corporation	742C083103JP
6	KOA Speer Electronics	RK73H2ATTD1003F
7	KOA Speer Electronics	RK73H2ATTD3322F

My current task is to use sequence match for the values in MANUFACTURER PART NUMBER 1, 3266W-1-100 to CAT16-680J4LF, CAT16-680J4LF to CR0603AFX-1000ELF. Then add a column to the dataframe containing the sequence matcher ratio

	MANUFACTURER	MANUFACTURER PART NUMBER 1	Similarity Score
0	Bourns	3266W-1-100	0.25
1	Bourns	CAT16-680J4LF	0.26666666666666666
2	Bourns	CR0603AFX-1000ELF	0.7647058823529411
3	Bourns	CR0603-FX-2701ELF	0.8235294117647058
4	Bourns	CR0603-FX-30R0ELF	0.27586206896551724
5	CTS Corporation	742C083103JP	0.37037037037037035
6	KOA Speer Electronics	RK73H2ATTD1003F	0.8
7	KOA Speer Electronics	RK73H2ATTD3322F

Is that possible? Or should I first make the values as strings first?

My overarching goal is to extract the most occuring series of characters for these codes, and therefore the series family for the codes concerned.

For example: RK73H2ATTD1003F and RK73H2ATTD3322F

Since they are both KOA Speer Electronics, and they just differ in the 'xxxx' characters, the possible series of these is RK73H2ATTDxxxxF

My workflow is

Import data from excel
Sequence match the column MANUFACTURER PART NUMBER 1 values with the adjacent value (value of MANUFACTURER PART NUMBER 1 in the row below it)
As long as the ratio from the sequence match is near to each other, they are possible part of a series family. Add these values as additional column in dataframe.

Final step (another discussion):

Extract the most occurring string of characters from codes with high similarity scores, and then append it as another column.

Currently using this to manually check scores:

import difflib

string1 = "RK73H2ATTD1003F"
string2 = "742C083103JP"

temp = difflib.SequenceMatcher(None,string1 ,string2)

print('Similarity Score: ',temp.ratio())

Sorry for the long post!

CodePudding user response：

Mentioned code snippet should solve your purpose :). Do optimize the logic if required

INPUT df:

df

  idx           MANUFACTURER MANUFACTURER PART NUMBER 1
0   0                 Bourns                3266W-1-100
1   1                 Bourns              CAT16-680J4LF
2   2                 Bourns          CR0603AFX-1000ELF
3   3                 Bourns          CR0603-FX-2701ELF
4   4                 Bourns          CR0603-FX-30R0ELF
5   5        CTS Corporation               742C083103JP
6   6  KOA Speer Electronics            RK73H2ATTD1003F
7

7 KOA Speer Electronics RK73H2ATTD3322F

rows, cols = df.shape
col_idx = 2 # Col Index for "MANUFACTURER PART NUMBER 1", please change this if you change the column name or position

for row in range(rows-1):
    string1 = df.iloc[row, col_idx]
    string2 = df.iloc[row   1, col_idx]
    if string2 != '':
        temp = difflib.SequenceMatcher(None, string1 ,string2)
        # print(string1, " :: ", string2, " :: ", temp.ratio())
        df.loc[row, 'Similarity Score'] = temp.ratio()

df

OUTPUT:

  idx           MANUFACTURER MANUFACTURER PART NUMBER 1 Similarity Score
0   0                 Bourns                3266W-1-100             0.25
1   1                 Bourns              CAT16-680J4LF              0.4
2   2                 Bourns          CR0603AFX-1000ELF         0.764706
3   3                 Bourns          CR0603-FX-2701ELF         0.823529
4   4                 Bourns          CR0603-FX-30R0ELF         0.275862
5   5        CTS Corporation               742C083103JP          0.37037
6   6  KOA Speer Electronics            RK73H2ATTD1003F              0.8
7   7  KOA Speer Electronics            RK73H2ATTD3322F