I am reading in data from multiple Excel files and writing them back to an aggregated Excel file.
So I have this output, and it represents the relations of multiple entities within my company (enity-ID
) with other companies (debitor-name
):
debitor_list = [
("1", "X AG"),
("1", "X AG"),
("1", "Z AG"),
("2", "X AG"),
("2", "X AG"),
("3", "LOL AG"),
("1", "Z AG"),
("1", "HS AG"),
("2", "hs ag")
]
The tuples structure within this list is the following:
('entity-ID', 'debitor-name')
In addition, I have a list which represents the real names and information about debitors:
real_file = ["LOLLIPOP AG", "HS AG", "X AG", "Z AG"]
Then I am checking for similarities between debitor name in debitor_list
and real_file
to replace with the real name:
import difflib as dif
for deb in debitor_list:
for cam in cam_file:
if deb[1] != cam:
sequence = dif.SequenceMatcher(
isjunk=None,
a=deb[1].lower(),
b=cam.lower()
)
match = sequence.ratio() * 100
if (match >= 80):
print(deb[1], cam, match)
debitor_list.append((deb[0], cam))
Output:
hs ag HS AG 100.0
How can I delete the ("2", "hs ag")
tuple?
CodePudding user response:
Either you replace the whole list, or you replace the element in place with some simple logic, see the 2 options below.
Note that tuples might be immutable, but the list itself is not...
import difflib as dif
debitor_list = [
("1", "X AG"),
("1", "X AG"),
("1", "Z AG"),
("2", "X AG"),
("2", "X AG"),
("3", "LOL AG"),
("1", "Z AG"),
("1", "HS AG"),
("2", "hs ag"),
]
real_file = ["LOLLIPOP AG", "HS AG", "X AG", "Z AG"]
def fix_stuff(d_list, c_list):
result = []
for deb in d_list:
repl_val = None
for cam in c_list:
if deb[1] != cam:
sequence = dif.SequenceMatcher(
isjunk=None, a=deb[1].lower(), b=cam.lower()
)
match = sequence.ratio() * 100
if match >= 80:
repl_val = cam
if repl_val:
result.append((deb[0], repl_val))
else:
result.append(deb)
return result
print(debitor_list)
new_deb_list = fix_stuff(debitor_list, real_file)
print(new_deb_list)
for idx, deb in enumerate(debitor_list):
for cam in real_file:
if deb[1] != cam:
sequence = dif.SequenceMatcher(isjunk=None, a=deb[1].lower(), b=cam.lower())
match = sequence.ratio() * 100
if match >= 80:
debitor_list[idx] = (deb[0], cam)
print(debitor_list)
output
[('1', 'X AG'), ('1', 'X AG'), ('1', 'Z AG'), ('2', 'X AG'), ('2', 'X AG'), ('3', 'LOL AG'), ('1', 'Z AG'), ('1', 'HS AG'), ('2', 'hs ag')]
[('1', 'X AG'), ('1', 'X AG'), ('1', 'Z AG'), ('2', 'X AG'), ('2', 'X AG'), ('3', 'LOL AG'), ('1', 'Z AG'), ('1', 'HS AG'), ('2', 'HS AG')]
[('1', 'X AG'), ('1', 'X AG'), ('1', 'Z AG'), ('2', 'X AG'), ('2', 'X AG'), ('3', 'LOL AG'), ('1', 'Z AG'), ('1', 'HS AG'), ('2', 'HS AG')]
The if repl_val
checks if the value needs to be replaced. Since the variable repl_val
gets set to None
at the start of each for, if repl_val
will only be true if it was changed during the loop.
As for using result
, when using the function, we're not modifying the incoming lists, but we return a new list result
.
as for the second way to do this (and that is likely the better way), due to the usage of enumerate
we get an index (idx
) for each list element, as well as the value deb
. It allows for directly assigning to the original list by it's index, so it's a direct modification of the original list.