How to find duplicate strings(communication address) in python list?-CodePudding

I have a task of removing duplicate addresses from a list.

Case1: list of 5 addresses in which there are 2 only required and 3 are duplicated.

['3805 Swan House Ct||Burtonsville|MD|20866',
 '3805 Swan House Ct||Burtonsville|Md|20866',
 '6113 Loventree Rd||Columbia|MD|21044',
 '6113 Loventree Rd||Columbia|Md|21044',
 '6113 Loventree Road||Columbia|MD|21044']

Here address '3805 Swan House Ct||Burtonsville|MD|20866' and '3805 Swan House Ct||Burtonsville|Md|20866' are similar, hence here it should return any of address considering length, here '3805 Swan House Ct||Burtonsville|MD|20866' will be OK.

In case of '6113 Loventree' address variats- these are 3 address after comparing it should return '6113 Loventree Road||Columbia|MD|21044'

Expected Output:

['3805 Swan House Ct||Burtonsville|MD|20866','6113 Loventree Road||Columbia|MD|21044']

Case2: list of 3 address here only one address required to be extracted.

['4512 Fairfax Road|Apartment 2|Baltimore|MD|21216', '4512fairfaxrd|Apt2|Baltimore|Md|21216', '4512 Fairfax Rd|Apt 2|Baltimore|Md|21216']

expected output: considering the highest length of address.

['4512 Fairfax Road|Apartment 2|Baltimore|MD|21216']

CodePudding user response：

you can use difflib. but im not sure about how exactly it matches with close matched datas.

from collections import OrderedDict
import difflib

data = ['3805 Swan House Ct||Burtonsville|MD|20866',
    '3805 Swan House Ct||Burtonsville|Md|20866',
    '6113 Loventree Rd||Columbia|MD|21044',
    '6113 Loventree Rd||Columbia|Md|21044',
    '6113 Loventree Road||Columbia|MD|21044',
    "123 Cherry Lane Apt 12",
    "123 Cherry Lane Apt 121"]

test = []
for word in data:
     new_list = difflib.get_close_matches(word, data)
     match_data = [i for i in data if any((j in i) for j in new_list)][:1]
     test.append(match_data[0])
remove_dup = list(OrderedDict.fromkeys(test))
print(remove_dup)

>>> ['3805 Swan House Ct||Burtonsville|MD|20866', '6113 Loventree Rd||Columbia|MD|21044', '123 Cherry Lane Apt 12']

if you want address based on its length:

test = []
for word in data:
    new_list = difflib.get_close_matches(word, data)
    match_data = [i for i in data if any((j in i) for j in new_list)]
    test_data = []
    for i in match_data:
        if not test_data:
            test_data.append(i)
        if test_data:
            if len(test_data[-1]) < len(i):
                test_data.remove(test_data[-1])
                test_data.append(i)
test.append(test_data[0])

remove_dup = list(OrderedDict.fromkeys(test))
print(remove_dup)

>>> ['3805 Swan House Ct||Burtonsville|MD|20866', '6113 Loventree Road||Columbia|MD|21044', '123 Cherry Lane Apt 121']