Home > Enterprise >  Python - Extracting only necessary elements from a string
Python - Extracting only necessary elements from a string

Time:09-14

I'm trying to extract only the parts I need from the table.

    2555    texttext    0   100 100 0   0   0   0   lowness 0
    2557    texttext    10  650 660 0   0   0   0   lowness 0
    2564    texttext    0   30  30  0   0   0   0   lowness 0
    2566    texttext    0   0   0   0   0   0   0   lowness 0
    2567    texttext    10  70  80  0   0   0   0   lowness 0

All I need is 'text text' and/ immediately followed by two numbers and 'low' as shown below.

    texttext    0   100 lowness
    texttext    10  650 lowness
    texttext    0   30  lowness
    texttext    0   0   lowness
    texttext    10  70  lowness

I tried this but failed.

text = """
    2555    texttext    0   100 100 0   0   0   0   lowness 0
    2557    texttext    10  650 660 0   0   0   0   lowness 0
    2564    texttext    0   30  30  0   0   0   0   lowness 0
    2566    texttext    0   0   0   0   0   0   0   lowness 0
    2567    texttext    10  70  80  0   0   0   0   lowness 0
"""

for a in text.split('\n'):
    if a == "":
        continue
    else:
        print(a)
        m = re.match('(^\D\d*\D)(\w*\s)(\d*\s)(\d*\s)(\d*\s\d*\s\d*\s\d*\s\d*\s)(\w )', a)
        print(m)
        print(m.group(2), m.group(3), m.group(4), m.group(6))

I tried to group by regex and get the parts, but I got the following error: Help / print(m.group(2), m.group(3), m.group(4), m.group(6)) AttributeError: 'NoneType' object has no attribute 'group'

CodePudding user response:

If you absolutely want to use a regular expression:

import re

text = """
    2555    texttext    0   100 100 0   0   0   0   lowness 0
    2557    texttext    10  650 660 0   0   0   0   lowness 0
    2564    texttext    0   30  30  0   0   0   0   lowness 0
    2566    texttext    0   0   0   0   0   0   0   lowness 0
    2567    texttext    10  70  80  0   0   0   0   lowness 0
"""
pattern = re.compile(
    r"\s*\d \s (\w )\s (\d )\s (\d )\s \d \s \d \s \d \s \d \s \d \s (\w )\s "
)

for line in text.strip().split('\n'):
    match = re.search(pattern, line)
    print(*match.groups())

Output:

texttext 0 100 lowness
texttext 10 650 lowness
texttext 0 30 lowness
texttext 0 0 lowness
texttext 10 70 lowness

But if it is really the case that it's always the same number of space-separated substrings of characters, then you might really be better off just splitting the lines by spaces:

for line in text.strip().split('\n'):
    parts = line.split()
    print(parts[1], parts[2], parts[3], parts[9])

Same output.

CodePudding user response:

You are not getting a match, because you are only matching a single \D and a single \s which match a single character.

But in the example data, there are more repetitions of the same characters to get to the next match.

If you fix that, you will get a match but with the wrong data in the groups, see https://regex101.com/r/v3ddai/1


Instead, you can just use 2 capture groups.

As there always seem to be digits present, you can change \d* to \d

^\s*\d \s (\w \s \d \s \d \s )\d \s \d \s \d \s \d \s \d \s (\w )

Regex demo

CodePudding user response:

Try this:

for a in text.split('\n'):
    if a == "":
        continue
    else:
        parts = a.split()
        print(parts[1],parts[2],parts[3],parts[9])

CodePudding user response:

for e in text.splitlines():
    if e:
        ls = e.split()
        print(ls[1:4]   ls[-2:-1])

['texttext', '0', '100', 'lowness']
['texttext', '10', '650', 'lowness']
['texttext', '0', '30', 'lowness']
['texttext', '0', '0', 'lowness']
['texttext', '10', '70', 'lowness']
  • Related