I have a dataframe column that has an input like below.
Input = '{1:A06YCASDB2LXXXXX000000}{2:A303TYDBTM2AXXD}{3:{108:23158}}{4:\r\n:20:APS0182405\r\n:23B:DRED\r\n:32A:182349USD3280,00\r\n:33B:USD31280,00\r\n:52M:/73240222\r\nRAWR UK Ltd\r\n28 School Road\r\nfast\r\nCo. Angrid\r\n:57A:TETRIS\r\n:59:/BU500023231012000066241\r\nDUMMYNAME DUMMYLASTNAME\r\PLACE/REST\r\n:70:PA74536/39\r\n:71A:OUR\r\n-}
I have developed a chain regex method to apply multiple re.sub operations
def chainRegex(string):
string = re.sub(":\\d{2}[A-Z]?:"," ", string)
string = re.sub("\r\n"," ", string)
string = [re.sub("([^a-zA-Z ] ?)","",i) for i in string.split()]
string = list(filter(None, string))
return string
The expected output is given a list below.
output = ['AYCASDBLXXXXXATYDBTMAXXD', 'APS', 'DRED', 'USD', 'USD', 'RAWR', 'UK', 'Ltd', 'School','Road', 'fast', 'Co', 'Angrid', 'TETRIS', 'BU', 'DUMMYNAME', 'DUMMYLASTNAME', 'PLACEREST', 'PA', 'OUR']
Is there a way to combine these multiple resub operations into one to make it faster or is there an alternative faster operation? Parsing option won't work because the structure of string sometimes corrupted (missing {} or keys).
CodePudding user response:
You can use
def chainRegex(string):
x = re.sub(r"(?::\d{2}[A-Z]?:|\r\n) ", " ", string).split()
return [w for w in ["".join(c for c in i if c.isalpha()) for i in x] if w != ""]
See the Python demo.
Here,
re.sub(r"(?::\d{2}[A-Z]?:|\r\n) ", " ", string).split()
finds all one or more sequences of a colon two digits, an optional letter and a colon or a CRLF line endings and replaces them with a single space["".join(c for c in i if c.isalpha()) for i in x]
- removes all non-letters from each word[w for w in ... if w != ""]
omits the empty items.