Is there a way to combine multiple resub operations into one to make it faster in Python?-CodePudding

I have a dataframe column that has an input like below.

Input = '{1:A06YCASDB2LXXXXX000000}{2:A303TYDBTM2AXXD}{3:{108:23158}}{4:\r\n:20:APS0182405\r\n:23B:DRED\r\n:32A:182349USD3280,00\r\n:33B:USD31280,00\r\n:52M:/73240222\r\nRAWR UK Ltd\r\n28 School Road\r\nfast\r\nCo. Angrid\r\n:57A:TETRIS\r\n:59:/BU500023231012000066241\r\nDUMMYNAME DUMMYLASTNAME\r\PLACE/REST\r\n:70:PA74536/39\r\n:71A:OUR\r\n-}

I have developed a chain regex method to apply multiple re.sub operations

    def chainRegex(string):                  
        string = re.sub(":\\d{2}[A-Z]?:"," ", string)
        string = re.sub("\r\n"," ", string)        
        string = [re.sub("([^a-zA-Z ] ?)","",i) for i in string.split()]
        string = list(filter(None, string))
        return string

The expected output is given a list below.

output = ['AYCASDBLXXXXXATYDBTMAXXD', 'APS', 'DRED', 'USD', 'USD', 'RAWR', 'UK', 'Ltd', 'School','Road', 'fast', 'Co', 'Angrid', 'TETRIS', 'BU', 'DUMMYNAME', 'DUMMYLASTNAME', 'PLACEREST', 'PA', 'OUR']

Is there a way to combine these multiple resub operations into one to make it faster or is there an alternative faster operation? Parsing option won't work because the structure of string sometimes corrupted (missing {} or keys).

CodePudding user response：

You can use

def chainRegex(string):                  
    x = re.sub(r"(?::\d{2}[A-Z]?:|\r\n) ", " ", string).split()
    return [w for w in ["".join(c for c in i if c.isalpha()) for i in x] if w != ""]

See the Python demo.

Here,

re.sub(r"(?::\d{2}[A-Z]?:|\r\n) ", " ", string).split() finds all one or more sequences of a colon two digits, an optional letter and a colon or a CRLF line endings and replaces them with a single space
["".join(c for c in i if c.isalpha()) for i in x] - removes all non-letters from each word
[w for w in ... if w != ""] omits the empty items.