Let it be the following dataframe, where we are only interested in the column 'number'. We will only work with the column 'number', the rest can be left out.
| others | color | number |
| ------ | ----- | ------ |
| one | green | 12A3B |
| other | red | 23*4C |
| one | red | 12**B |
| one | green | 52ATC |
| other | blue | unkno |
| other | blue | 231*C |
| one | red | 2398T |
I have implemented the following two functions to solve the problem I want to solve.
The first one, returns True if the strings are the same except for the '*' character.
def matching(row, sub_row):
string = row['number']
sub_string = sub_row['number']
flag = True
i=0
# In case they coincide in length
if len(string) == len(sub_string):
# If a character does not match sequentially, we exit the loop.
while i < len(string) and flag==True:
if string[i] != '*' and sub_string[i] != '*':
if string[i] != sub_string[i]:
flag = False
i =1
# If they do not match in length
else:
flag = False
return flag
My goal is to get the following dataframe from the original. Making use of the matching function implemented above.
| others | color | number | id_number |
| ------ | ----- | ------ | --------- |
| one | green | 12A3B | 0 |
| other | red | 23*4C | 1 |
| one | red | 12**B | 0 |
| one | green | 52ATC | 2 |
| other | blue | unkno | unkno |
| other | blue | 231*C | 1 |
| one | red | 2398T | 3 |
For this purpose, I have implemented the following function, which works correctly.
def create_id_number(df):
df['id_number'] = '-1'
ident=0
for i in range(len(df)):
if df.loc[i, 'id_number'] == '-1':
if df.loc[i, 'number'] == 'unkno':
df.loc[i, 'id_number'] = 'unkno'
else:
df.loc[i, 'id_number'] = ident
for j in range(i 1, len(df)):
if matching(df.loc[i], df.loc[j]) == True:
df.loc[j, 'id_number'] = ident
ident =1
return df
The implemented functions work. However, when I use a larger dataset (> 50000 rows) it takes a long time to execute. I would like to know if there is a way to improve this code by using some function from an already implemented python library or a more optimal way.
Thanks for your attention.
CodePudding user response:
It feels like there's a clever algorithm for matching the strings, but I can't immediately bring it to mind; let's speed up the rest first, though
One thing is that 95% of the cases are a straight match; those can use a dict
lookup, which will be very fast. We only need to do the approximate matching when there's a *
in the number
def matching(string1, string2):
if len(string1) != len(string2):
return False
for ch1, ch2 in zip(string1, string2):
if '*' != ch1 != ch2 != '*':
return False
return True
def create_id_number(df):
ident=0
ident_map = {
'unkno': 'unkno', # special case, map 'unkno' to 'unkno'
}
for i in range(len(df)):
number = df.loc[i, 'number']
if number in ident_map:
# already seen (or the special case) - look it up
df.loc[i, 'id_number'] = ident_map[number]
elif '*' not in number:
# not yet seen, no '*' - assign a new ident
df.loc[i, 'id_number'] = ident_map[number] = ident
ident = 1
else:
for prev_number, prev_ident in ident_map.values():
if matching(number, n):
df.loc[i, 'id_number'] = i
break
else:
# Has '*' but doesn't match anything - assign new ident
# Note: a subsequent non-* number which matches this will
# be given a different ident
df.loc[i, 'id_number'] = ident_map[number] = ident
ident = 1
return df
The main caveat here is that a sequence like 12**B
then 12Z9B
then 12A3B
will be assigned three different ident
s, not two. To fix that, we'd have to make two passes, like this:
def create_id_number(df):
ident=0
ident_map = {
'unkno': 'unkno', # special case, map 'unkno' to 'unkno'
}
# First pass - assign ident to each non-* number
for i in range(len(df)):
number = df.loc[i, 'number']
if '*' not in number not in ident_map:
# not yet seen, no '*' - assign a new ident
ident_map[number] = ident
ident = 1
# Second pass - fill in the id_number column
for i in range(len(df)):
number = df.loc[i, 'number']
if number in ident_map:
# literal number - look it up
df.loc[i, 'id_number'] = ident_map[number]
else:
# number with * - search for it
for prev_number, prev_ident in ident_map.values():
if matching(number, n):
df.loc[i, 'id_number'] = ident_map[number]
= i
break
else:
# Has '*' but doesn't match anything - assign new ident
df.loc[i, 'id_number'] = ident_map[number] = ident
ident = 1
return df
This will assign two idents to a sequence like 12**B
then 12Z9B
then 12A3B
, and one ident to a sequence like 12*9B
then 12Z*B
.
One note: in Python, we can do three-way comparisons like 1 < x < 3; I use that twice in this code:
'*' != ch1 != ch2 != '*'
to mean'*' != ch1 and ch1 != ch2 and ch2 != '*'
'*' not in number not in ident_map
to mean'*' not in number and number not in ident_map
I'd have to measure to see if that's actually faster or if it just feels like it should be, but is the same speed...