Home > Mobile >  Removing similar rows from a csv file
Removing similar rows from a csv file

Time:08-12

I want to remove similar rows from the csv file. Is there a function which can compare strings and discard them if there is an 80% match?

input data

    Identifier  StructureData
0   Entry ID    Structure Title
1   5O47        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
2   5FK8        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Neo-erlose
3   5FK7        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with neokestose
4   5FMD        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with nystose
5   5FKC        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Raffinose
6   5FIX        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with sucrose
7   6S82        Structure Of D80A-Fructofuranosidase From Xanthophyllomyces Dendrorhous Complexed With Tris-buffer molecule And hydroquinone
8   7ALJ        Structure of Drosophila Notch EGF domains 11-13
9   6YX5        Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
10  3U75        Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
11  5IFT        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose
12  5MGC        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 4-Galactosyl-lactose
13  5JUV        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-b-Galactopyranosyl galactose
14  5MGD        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-Galactosyl-lactose
15  5IHR        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH ALLOLACTOSE

output data

    Identifier  StructureData
0   Entry ID    Structure Title
1   5O47        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
2   7ALJ        Structure of Drosophila Notch EGF domains 11-13
3   6YX5        Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
4   3U75        Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
5   5IFT        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose

CodePudding user response:

There are several string similarity measures, here's one that is built in:

from difflib import SequenceMatcher

def measure_similarity(x, y):
    return SequenceMatcher(None, x, y).ratio()

s1 = '5FMD        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with nystose'
s2 = '5FIX        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with sucrose'
sim = measure_similarity(s1, s2)

print(sim)

Result:

0.9528301886792453

CodePudding user response:

you are looking for something called fuzzy matching. There's a bunch of libraries for this, but FuzzyWuzzy is a good one. See this article for a good overview...

Note that the code below does not meet your expectations completely as it includes row 7 from your data, as the difference is more than 80% compared to the previous rows.

from fuzzywuzzy import fuzz


data = """Identifier  StructureData
Entry ID    Structure Title
5O47        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
5FK8        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Neo-erlose
5FK7        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with neokestose
5FMD        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with nystose
5FKC        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with Raffinose
5FIX        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with sucrose
6S82        Structure Of D80A-Fructofuranosidase From Xanthophyllomyces Dendrorhous Complexed With Tris-buffer molecule And hydroquinone
7ALJ        Structure of Drosophila Notch EGF domains 11-13
6YX5        Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
3U75        Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
5IFT        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose
5MGC        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 4-Galactosyl-lactose
5JUV        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-b-Galactopyranosyl galactose
5MGD        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 6-Galactosyl-lactose
5IHR        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH ALLOLACTOSE"""

result = []
for line_data in data.splitlines():
    maxvalue = 0
    for line_result in result:
        maxvalue = max(maxvalue, fuzz.ratio(line_data, line_result))
    if maxvalue < 80:
        result.append(line_data)


for line in result:
    print(line)

output

Identifier  StructureData
Entry ID    Structure Title
5O47        Structure of D80A-fructofuranosidase from Xanthophyllomyces dendrorhous complexed with fructosyl-hydroxytyrosol
6S82        Structure Of D80A-Fructofuranosidase From Xanthophyllomyces Dendrorhous Complexed With Tris-buffer molecule And hydroquinone
7ALJ        Structure of Drosophila Notch EGF domains 11-13
6YX5        Structure of DrrA from Legionella pneumophilia in complex with human Rab8a
3U75        Structure of E230A-fructofuranosidase from Schwanniomyces occidentalis complexed with fructosylnystose
5IFT        STRUCTURE OF E298Q-BETA-GALACTOSIDASE FROM ASPERGILLUS NIGER IN COMPLEX WITH 3-b-Galactopyranosyl glucose
  • Related