I have a task of removing duplicate addresses from a list.
Case1: list of 5 addresses in which there are 2 only required and 3 are duplicated.
['3805 Swan House Ct||Burtonsville|MD|20866',
'3805 Swan House Ct||Burtonsville|Md|20866',
'6113 Loventree Rd||Columbia|MD|21044',
'6113 Loventree Rd||Columbia|Md|21044',
'6113 Loventree Road||Columbia|MD|21044']
Here address '3805 Swan House Ct||Burtonsville|MD|20866' and '3805 Swan House Ct||Burtonsville|Md|20866' are similar, hence here it should return any of address considering length, here '3805 Swan House Ct||Burtonsville|MD|20866' will be OK.
In case of '6113 Loventree' address variats- these are 3 address after comparing it should return '6113 Loventree Road||Columbia|MD|21044'
Expected Output:
['3805 Swan House Ct||Burtonsville|MD|20866','6113 Loventree Road||Columbia|MD|21044']
Case2: list of 3 address here only one address required to be extracted.
['4512 Fairfax Road|Apartment 2|Baltimore|MD|21216', '4512fairfaxrd|Apt2|Baltimore|Md|21216', '4512 Fairfax Rd|Apt 2|Baltimore|Md|21216']
expected output: considering the highest length of address.
['4512 Fairfax Road|Apartment 2|Baltimore|MD|21216']
CodePudding user response:
you can use difflib. but im not sure about how exactly it matches with close matched datas.
from collections import OrderedDict
import difflib
data = ['3805 Swan House Ct||Burtonsville|MD|20866',
'3805 Swan House Ct||Burtonsville|Md|20866',
'6113 Loventree Rd||Columbia|MD|21044',
'6113 Loventree Rd||Columbia|Md|21044',
'6113 Loventree Road||Columbia|MD|21044',
"123 Cherry Lane Apt 12",
"123 Cherry Lane Apt 121"]
test = []
for word in data:
new_list = difflib.get_close_matches(word, data)
match_data = [i for i in data if any((j in i) for j in new_list)][:1]
test.append(match_data[0])
remove_dup = list(OrderedDict.fromkeys(test))
print(remove_dup)
>>> ['3805 Swan House Ct||Burtonsville|MD|20866', '6113 Loventree Rd||Columbia|MD|21044', '123 Cherry Lane Apt 12']
if you want address based on its length:
test = []
for word in data:
new_list = difflib.get_close_matches(word, data)
match_data = [i for i in data if any((j in i) for j in new_list)]
test_data = []
for i in match_data:
if not test_data:
test_data.append(i)
if test_data:
if len(test_data[-1]) < len(i):
test_data.remove(test_data[-1])
test_data.append(i)
test.append(test_data[0])
remove_dup = list(OrderedDict.fromkeys(test))
print(remove_dup)
>>> ['3805 Swan House Ct||Burtonsville|MD|20866', '6113 Loventree Road||Columbia|MD|21044', '123 Cherry Lane Apt 121']