regex to capture part of the string that contains tabs/spaces and substring from text file-CodePudding

I have a text file that has the following info:

"Where_can_i_find        red    capture    state"
"Why_are_you        orange    00:AO    state"
"Salty_pepper        gray    good    state"

with open(cur_path,'r') as file:
    data = file.read()
    itm1 = re.search('Where_can_i_find        (. ?)state',data.group(1)
    itm2 = re.search('Salty_pepper        (. ?)state',data.group(1)

This will give me red capture and gray good etc...But I only want to get capture for the first item and good for the second item without the red and gray part. In other words I want to skip everything on the 2nd column.

How should I change my regex for this to work?

CodePudding user response：

You can use

for line in file:
    if line.strip().endswith('state') and any(line.strip().startswith(x) for x in ['Where_can_i_find','Salty_pepper']):
        print(line.split()[-2])

See the Python demo.

Notes:

line.strip().endswith('state') - checks if the line ends with state
any(line.strip().startswith(x) for x in ['Where_can_i_find','Salty_pepper']) - checks if the line starts with one of the specified strings.

CodePudding user response：

The \s (any whitespace) and \S (non-whitespace) classes are useful here.

To match a single non-whitespace sequence \S , separated with a single whitespace sequence \s right before state:

re.search(r'Where_can_i_find.*?(\S )\s state',data).group(1)

re.search(r'Salty_pepper.*?(\S )\s state',data).group(1)

Since you mention 'columns', another approach would be to split the whole thing into columns first, and then select the right items. For instance:

data = '''Where_can_i_find        red    capture    state
Why_are_you        orange    00:AO    state
Salty_pepper        gray    good    state'''

data_split = [line.split() for line in data.splitlines()]

data_dict = {line[0]: line[2] for line in data_split}

> data_dict
{'Where_can_i_find': 'capture',
 'Why_are_you': '00:AO',
 'Salty_pepper': 'good'}

Since this avoids regexes, it can be a lot faster (perhaps depending on how many of the lines you actually want to access).

CodePudding user response：

If all your lines end with state, then you can use:(\S )(?=\s state)
Test here: https://regex101.com/r/HQVXRe/1

Since you want to get the second column of specific lines, you can use startswith to find such lines and the use re.search.

import re

s = '''"Where_can_i_find        red    capture    state"
"Why_are_you        orange    00:AO    state"
"Salty_pepper        gray    good    state"'''
lines  = s.split('\n')

pattern = re.compile(r'(\S )(?=\s state)', re.M)
prefixes = ('"Where_can_i_find','"Salty_pepper')

for line in lines:
    if(line.startswith(prefixes)):
        print(pattern.findall(line))
# capture
# good