I want to remove similar rows from the csv file. Is there a function which can compare strings and discard them if there is an 80% match?
input data
Identifier StructureData
0 Entry ID Structure Title
1 5O47 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
2 5FK8 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Neo-erlose
3 5FK7 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with neokestose
4 5FMD Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with nystose
5 5FKC Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Raffinose
6 5FIX Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with sucrose
7 6S82 Structure Of D80A-Fructofuranosidase From Xanthophyllomyces Dendrorhous Complexed With Tris-buffer molecule And hydroquinone
8 7ALJ Structure of Drosophila Notch EGF domains 11-13
9 6YX5 Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
10 3U75 Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
11 5IFT STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose
12 5MGC STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 4-Galactosyl-lactose
13 5JUV STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-b-Galactopyranosyl galactose
14 5MGD STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-Galactosyl-lactose
15 5IHR STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH ALLOLACTOSE
output data
Identifier StructureData
0 Entry ID Structure Title
1 5O47 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
2 7ALJ Structure of Drosophila Notch EGF domains 11-13
3 6YX5 Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
4 3U75 Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
5 5IFT STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose
CodePudding user response:
There are several string similarity measures, here's one that is built in:
from difflib import SequenceMatcher
def measure_similarity(x, y):
return SequenceMatcher(None, x, y).ratio()
s1 = '5FMD Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with nystose'
s2 = '5FIX Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with sucrose'
sim = measure_similarity(s1, s2)
print(sim)
Result:
0.9528301886792453
CodePudding user response:
you are looking for something called fuzzy matching. There's a bunch of libraries for this, but FuzzyWuzzy is a good one. See this article for a good overview...
Note that the code below does not meet your expectations completely as it includes row 7 from your data, as the difference is more than 80% compared to the previous rows.
from fuzzywuzzy import fuzz
data = """Identifier StructureData
Entry ID Structure Title
5O47 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
5FK8 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Neo-erlose
5FK7 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with neokestose
5FMD Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with nystose
5FKC Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Raffinose
5FIX Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with sucrose
6S82 Structure Of D80A-Fructofuranosidase From Xanthophyllomyces Dendrorhous Complexed With Tris-buffer molecule And hydroquinone
7ALJ Structure of Drosophila Notch EGF domains 11-13
6YX5 Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
3U75 Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
5IFT STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose
5MGC STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 4-Galactosyl-lactose
5JUV STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-b-Galactopyranosyl galactose
5MGD STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-Galactosyl-lactose
5IHR STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH ALLOLACTOSE"""
result = []
for line_data in data.splitlines():
maxvalue = 0
for line_result in result:
maxvalue = max(maxvalue, fuzz.ratio(line_data, line_result))
if maxvalue < 80:
result.append(line_data)
for line in result:
print(line)
output
Identifier StructureData
Entry ID Structure Title
5O47 Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
6S82 Structure Of D80A-Fructofuranosidase From Xanthophyllomyces Dendrorhous Complexed With Tris-buffer molecule And hydroquinone
7ALJ Structure of Drosophila Notch EGF domains 11-13
6YX5 Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
3U75 Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
5IFT STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose