Optimise code for assigning an identifier based on a string comparison function-CodePudding

Let it be the following dataframe, where we are only interested in the column 'number'. We will only work with the column 'number', the rest can be left out.

| others | color | number |
| ------ | ----- | ------ |
|  one   | green |  12A3B |
| other  |  red  |  23*4C |
|  one   |  red  |  12**B |
|  one   | green |  52ATC |
| other  |  blue |  unkno |
| other  |  blue |  231*C |
|  one   |  red  |  2398T |

I have implemented the following two functions to solve the problem I want to solve.

The first one, returns True if the strings are the same except for the '*' character.

def matching(row, sub_row):
    string = row['number']
    sub_string = sub_row['number']
    flag = True
    i=0
    # In case they coincide in length
    if len(string) == len(sub_string):
        # If a character does not match sequentially, we exit the loop.
        while i < len(string) and flag==True:
            if string[i] != '*' and sub_string[i] != '*':
                if string[i] != sub_string[i]:
                    flag = False
            i =1
    # If they do not match in length 
    else:
        flag = False
    
    return flag

My goal is to get the following dataframe from the original. Making use of the matching function implemented above.

| others | color | number | id_number |
| ------ | ----- | ------ | --------- |
|  one   | green |  12A3B |     0     |
| other  |  red  |  23*4C |     1     |
|  one   |  red  |  12**B |     0     |
|  one   | green |  52ATC |     2     |
| other  |  blue |  unkno |   unkno   |
| other  |  blue |  231*C |     1     |
|  one   |  red  |  2398T |     3     |

For this purpose, I have implemented the following function, which works correctly.

def create_id_number(df):
    df['id_number'] = '-1'
    ident=0
    for i in range(len(df)):
        if df.loc[i, 'id_number'] == '-1':
            if df.loc[i, 'number'] == 'unkno':
                df.loc[i, 'id_number'] = 'unkno'
            else:
               df.loc[i, 'id_number'] = ident
               for j in range(i 1, len(df)):
                   if matching(df.loc[i], df.loc[j]) == True:
                       df.loc[j, 'id_number'] = ident
               ident =1
        
    return df

The implemented functions work. However, when I use a larger dataset (> 50000 rows) it takes a long time to execute. I would like to know if there is a way to improve this code by using some function from an already implemented python library or a more optimal way.

Thanks for your attention.

CodePudding user response：

It feels like there's a clever algorithm for matching the strings, but I can't immediately bring it to mind; let's speed up the rest first, though

One thing is that 95% of the cases are a straight match; those can use a dict lookup, which will be very fast. We only need to do the approximate matching when there's a * in the number

def matching(string1, string2):
    if len(string1) != len(string2):
        return False
    for ch1, ch2 in zip(string1, string2):
        if '*' != ch1 != ch2 != '*':
            return False
    return True

def create_id_number(df):
    ident=0
    ident_map = {
        'unkno': 'unkno',  # special case, map 'unkno' to 'unkno'
    }
    for i in range(len(df)):
        number = df.loc[i, 'number']
        if number in ident_map:
            # already seen (or the special case) - look it up
            df.loc[i, 'id_number'] = ident_map[number]
        elif '*' not in number:
            # not yet seen, no '*' - assign a new ident
            df.loc[i, 'id_number'] = ident_map[number] = ident
            ident  = 1
        else:
            for prev_number, prev_ident in ident_map.values():
               if matching(number, n):
                   df.loc[i, 'id_number'] = i
                   break
            else:
                # Has '*' but doesn't match anything - assign new ident
                # Note: a subsequent non-* number which matches this will
                # be given a different ident
                df.loc[i, 'id_number'] = ident_map[number] = ident
                ident  = 1
    return df

The main caveat here is that a sequence like 12**B then 12Z9B then 12A3B will be assigned three different idents, not two. To fix that, we'd have to make two passes, like this:

def create_id_number(df):
    ident=0
    ident_map = {
        'unkno': 'unkno',  # special case, map 'unkno' to 'unkno'
    }
    # First pass - assign ident to each non-* number
    for i in range(len(df)):
        number = df.loc[i, 'number']
        if '*' not in number not in ident_map:
            # not yet seen, no '*' - assign a new ident
            ident_map[number] = ident
            ident  = 1

    # Second pass - fill in the id_number column
    for i in range(len(df)):
        number = df.loc[i, 'number']
        if number in ident_map:
            # literal number - look it up
            df.loc[i, 'id_number'] = ident_map[number]
        else:
            # number with * - search for it
            for prev_number, prev_ident in ident_map.values():
               if matching(number, n):
                   df.loc[i, 'id_number'] = ident_map[number]
= i
                   break
            else:
                # Has '*' but doesn't match anything - assign new ident

                df.loc[i, 'id_number'] = ident_map[number] = ident
                ident  = 1
    return df

This will assign two idents to a sequence like 12**B then 12Z9B then 12A3B, and one ident to a sequence like 12*9B then 12Z*B.

One note: in Python, we can do three-way comparisons like 1 < x < 3; I use that twice in this code:

'*' != ch1 != ch2 != '*' to mean '*' != ch1 and ch1 != ch2 and ch2 != '*'
'*' not in number not in ident_map to mean '*' not in number and number not in ident_map

I'd have to measure to see if that's actually faster or if it just feels like it should be, but is the same speed...