Home > database >  regex conditional extract in python-re
regex conditional extract in python-re

Time:05-31

I am trying to extract the SQLCODE which typically is 4 integers from error text like below :

1. SQLSTATE: 22018, SQLCODE: 3535.
2. SQLSTATE: 40001, SQLCODE: 2631 Session Id           629709103
3. SQLSTATE: 40001, SQLCODE: 2631                Session Id  594700603
4. SQLSTATE: T7547, SQLCODE:   754Session Id613234380

I have below pattern matching regex for the same currently. However, there is aedge case like 4 which is failing.

error_cd = re.findall(r'SQLCODE:\s([^.,\s] )', err_log)

If there are no 4 integers after SQLCODE, i would like to extract the SQLSTATE text after the letter 'T'

Expected Output :

1. 3535
2. 2631
3. 2631
4. 7547

Any suggestions on how to achieve this is appreciated. Thank you.

CodePudding user response:

This could probably be done purely in regex, using a lookahead/lookbehind to handle the conditional extraction, but that could get really messy.

SOLUTION 1: Pure Regex:

EDIT: Here's the pure regex solution... simpler than I thought it would be (though definitely more error-prone than the mixed approach... would need some additional logic to make it more robust):

re.findall(r'((?:(?<=SQLSTATE: T)(?![0-9]{4}, SQLCODE: [0-9]{4})[0-9]{4})|(?:(?<=SQLCODE: )[0-9]{4}))', err_log)

SOLUTION 2: Regex and Python:

The following solution uses regex to pull both the SQLSTATE and SQLCODE values, and uses a list comprehension to do the conditional extraction:

err_log = '''
1. SQLSTATE: 22018, SQLCODE: 3535.
2. SQLSTATE: 40001, SQLCODE: 2631 Session Id           629709103
3. SQLSTATE: 40001, SQLCODE: 2631                Session Id  594700603
4. SQLSTATE: T7547, SQLCODE:   754Session Id613234380
'''

error_st_cd = re.findall(r'SQLSTATE:  T([0-9] ), SQLCODE:  ([0-9]{4})?', err_log)
error_cd = [codes[1] or codes[0] for codes in error_st_cd]

for i, cd in enumerate(error_cd):
    print(f'{i 1}. {cd}')

Output:

  1. 3535
  2. 2631
  3. 2631
  4. 7547

CodePudding user response:

A regex approach might indeed be easier, but in any case, here's an approach w/o use of regex:

test_string = """
1. SQLSTATE: 22018, SQLCODE: 3535.
2. SQLSTATE: 40001, SQLCODE: 2631 Session Id           629709103
3. SQLSTATE: 40001, SQLCODE: 2631                Session Id  594700603
4. SQLSTATE: T7547, SQLCODE:   754Session Id613234380
""".strip()


def process_lines(s: str):
    for line in s.split('\n'):
        sql_code = ''.join(take_nums(line.split('SQLCODE: ', 1)[-1]))
        if len(sql_code) == 4:
            yield sql_code
        else:
            sql_state = ''.join(take_nums(line.split('SQLSTATE: ', 1)[-1][1:]))
            yield sql_state


def take_nums(s: str):
    """take from string only while we get space or numeric chars"""
    for c in s:
        if c.isnumeric():
            yield c
        elif not c.isspace():
            break


for i, line in enumerate(process_lines(test_string), 1):
    print(f'{i}. {line!r}')

Result:

1. '3535'
2. '2631'
3. '2631'
4. '7547'
  • Related