New to Python, so kind of figuring things out. I have a dataframe from an excel spreadsheet.
Something like this:
MANUFACTURER | MANUFACTURER PART NUMBER 1 | |
---|---|---|
0 | Bourns | 3266W-1-100 |
1 | Bourns | CAT16-680J4LF |
2 | Bourns | CR0603AFX-1000ELF |
3 | Bourns | CR0603-FX-2701ELF |
4 | Bourns | CR0603-FX-30R0ELF |
5 | CTS Corporation | 742C083103JP |
6 | KOA Speer Electronics | RK73H2ATTD1003F |
7 | KOA Speer Electronics | RK73H2ATTD3322F |
My current task is to use sequence match for the values in MANUFACTURER PART NUMBER 1, 3266W-1-100 to CAT16-680J4LF, CAT16-680J4LF to CR0603AFX-1000ELF. Then add a column to the dataframe containing the sequence matcher ratio
MANUFACTURER | MANUFACTURER PART NUMBER 1 | Similarity Score | |
---|---|---|---|
0 | Bourns | 3266W-1-100 | 0.25 |
1 | Bourns | CAT16-680J4LF | 0.26666666666666666 |
2 | Bourns | CR0603AFX-1000ELF | 0.7647058823529411 |
3 | Bourns | CR0603-FX-2701ELF | 0.8235294117647058 |
4 | Bourns | CR0603-FX-30R0ELF | 0.27586206896551724 |
5 | CTS Corporation | 742C083103JP | 0.37037037037037035 |
6 | KOA Speer Electronics | RK73H2ATTD1003F | 0.8 |
7 | KOA Speer Electronics | RK73H2ATTD3322F |
Is that possible? Or should I first make the values as strings first?
My overarching goal is to extract the most occuring series of characters for these codes, and therefore the series family for the codes concerned.
For example: RK73H2ATTD1003F and RK73H2ATTD3322F
Since they are both KOA Speer Electronics, and they just differ in the 'xxxx' characters, the possible series of these is RK73H2ATTDxxxxF
My workflow is
- Import data from excel
- Sequence match the column MANUFACTURER PART NUMBER 1 values with the adjacent value (value of MANUFACTURER PART NUMBER 1 in the row below it)
- As long as the ratio from the sequence match is near to each other, they are possible part of a series family. Add these values as additional column in dataframe.
Final step (another discussion):
- Extract the most occurring string of characters from codes with high similarity scores, and then append it as another column.
Currently using this to manually check scores:
import difflib
string1 = "RK73H2ATTD1003F"
string2 = "742C083103JP"
temp = difflib.SequenceMatcher(None,string1 ,string2)
print('Similarity Score: ',temp.ratio())
Sorry for the long post!
CodePudding user response:
Mentioned code snippet should solve your purpose :). Do optimize the logic if required
INPUT df:
df
idx MANUFACTURER MANUFACTURER PART NUMBER 1
0 0 Bourns 3266W-1-100
1 1 Bourns CAT16-680J4LF
2 2 Bourns CR0603AFX-1000ELF
3 3 Bourns CR0603-FX-2701ELF
4 4 Bourns CR0603-FX-30R0ELF
5 5 CTS Corporation 742C083103JP
6 6 KOA Speer Electronics RK73H2ATTD1003F
7
7 KOA Speer Electronics RK73H2ATTD3322F
rows, cols = df.shape
col_idx = 2 # Col Index for "MANUFACTURER PART NUMBER 1", please change this if you change the column name or position
for row in range(rows-1):
string1 = df.iloc[row, col_idx]
string2 = df.iloc[row 1, col_idx]
if string2 != '':
temp = difflib.SequenceMatcher(None, string1 ,string2)
# print(string1, " :: ", string2, " :: ", temp.ratio())
df.loc[row, 'Similarity Score'] = temp.ratio()
df
OUTPUT:
idx MANUFACTURER MANUFACTURER PART NUMBER 1 Similarity Score
0 0 Bourns 3266W-1-100 0.25
1 1 Bourns CAT16-680J4LF 0.4
2 2 Bourns CR0603AFX-1000ELF 0.764706
3 3 Bourns CR0603-FX-2701ELF 0.823529
4 4 Bourns CR0603-FX-30R0ELF 0.275862
5 5 CTS Corporation 742C083103JP 0.37037
6 6 KOA Speer Electronics RK73H2ATTD1003F 0.8
7 7 KOA Speer Electronics RK73H2ATTD3322F