I am trying to extract the SQLCODE which typically is 4 integers from error text like below :
1. SQLSTATE: 22018, SQLCODE: 3535.
2. SQLSTATE: 40001, SQLCODE: 2631 Session Id 629709103
3. SQLSTATE: 40001, SQLCODE: 2631 Session Id 594700603
4. SQLSTATE: T7547, SQLCODE: 754Session Id613234380
I have below pattern matching regex for the same currently. However, there is aedge case like 4 which is failing.
error_cd = re.findall(r'SQLCODE:\s([^.,\s] )', err_log)
If there are no 4 integers after SQLCODE, i would like to extract the SQLSTATE text after the letter 'T'
Expected Output :
1. 3535
2. 2631
3. 2631
4. 7547
Any suggestions on how to achieve this is appreciated. Thank you.
CodePudding user response:
This could probably be done purely in regex, using a lookahead/lookbehind to handle the conditional extraction, but that could get really messy.
SOLUTION 1: Pure Regex:
EDIT: Here's the pure regex solution... simpler than I thought it would be (though definitely more error-prone than the mixed approach... would need some additional logic to make it more robust):
re.findall(r'((?:(?<=SQLSTATE: T)(?![0-9]{4}, SQLCODE: [0-9]{4})[0-9]{4})|(?:(?<=SQLCODE: )[0-9]{4}))', err_log)
SOLUTION 2: Regex and Python:
The following solution uses regex to pull both the SQLSTATE and SQLCODE values, and uses a list comprehension to do the conditional extraction:
err_log = '''
1. SQLSTATE: 22018, SQLCODE: 3535.
2. SQLSTATE: 40001, SQLCODE: 2631 Session Id 629709103
3. SQLSTATE: 40001, SQLCODE: 2631 Session Id 594700603
4. SQLSTATE: T7547, SQLCODE: 754Session Id613234380
'''
error_st_cd = re.findall(r'SQLSTATE: T([0-9] ), SQLCODE: ([0-9]{4})?', err_log)
error_cd = [codes[1] or codes[0] for codes in error_st_cd]
for i, cd in enumerate(error_cd):
print(f'{i 1}. {cd}')
Output:
- 3535
- 2631
- 2631
- 7547
CodePudding user response:
A regex approach might indeed be easier, but in any case, here's an approach w/o use of regex:
test_string = """
1. SQLSTATE: 22018, SQLCODE: 3535.
2. SQLSTATE: 40001, SQLCODE: 2631 Session Id 629709103
3. SQLSTATE: 40001, SQLCODE: 2631 Session Id 594700603
4. SQLSTATE: T7547, SQLCODE: 754Session Id613234380
""".strip()
def process_lines(s: str):
for line in s.split('\n'):
sql_code = ''.join(take_nums(line.split('SQLCODE: ', 1)[-1]))
if len(sql_code) == 4:
yield sql_code
else:
sql_state = ''.join(take_nums(line.split('SQLSTATE: ', 1)[-1][1:]))
yield sql_state
def take_nums(s: str):
"""take from string only while we get space or numeric chars"""
for c in s:
if c.isnumeric():
yield c
elif not c.isspace():
break
for i, line in enumerate(process_lines(test_string), 1):
print(f'{i}. {line!r}')
Result:
1. '3535'
2. '2631'
3. '2631'
4. '7547'