python regex text extraction-CodePudding

the text input is something like this West Team 4, Eastern 3\n

-------Update--------

the input is a txt file containing team name and scores like a football game the whole text file will be something like this, two names and scores:

West Team 4, Eastern 5
Nott Team 2, Eastern 3
West wood 1, Eathan 2
West Team 4, Eas 5

I am using with open to read file line by line therefore there will be \n at the end of the line.

I would like to extract this line of text in to something like:

['West Team', 'Eastern']

What I currently have in mind is to use regex

result = re.sub("[\n^\s$\d]", "", text).split(",")

this code results in this:

['WestTeam','Eastern']

I'm sure that my regex is not correct. I want to remove '\n' and any number including the space in front of the number but not the space in the middle of the name.

Open to any suggestion that to achieve this result, doesn't necessarily use regex.

CodePudding user response：

You can use a non-regex approach to keep any letters/spaces after splitting with a comma:

text = "West Team 4, Eastern 3\n"
print( ["".join(c for c in x if c.isalpha() or c.isspace()).strip() for x in text.split(',')]  )
# => ['West Team', 'Eastern']

Or a regex approach to remove any chars other than ASCII letters and spaces matched with the [^a-zA-Z\s] pattern:

import re
rx = re.compile(r'[^a-zA-Z\s] ')
print( [rx.sub("", x).strip() for x in text.split(',')]  )
# => ['West Team', 'Eastern']

See the Python demo

In case there are consecutive non-letter chunks you can use

import re
text = "West Team 4, Eastern 3\n, test 23 99 test"
rx = re.compile(r'[^\W\d_] ')
print( [" ".join(rx.findall(x)) for x in text.split(',')]  )

See the Python demo yielding ['West Team', 'Eastern', 'test test']. The [^\W\d_] pattern matches any one or more Unicode letters.

CodePudding user response：

Actually re.findall might work well here:

inp = "West Team 4, Eastern 3\n"
matches = re.findall(r'(\w (?: \w )*) \d ', inp)
print(matches)  # ['West Team', 'Eastern']

The split version, using re.split:

inp = "West Team 4, Eastern 3\n"
matches = [x for x in re.split(r'\s \d \s*,?\s*', inp) if x != '']
print(matches)  # ['West Team', 'Eastern']

CodePudding user response：

import re

text = 'West Team 4, Eastern 3\n'

result = re.sub("[\n^$\d]", "", text).split(",")

# REMOVE THE LEADING AND TRAILING SPACES:
result = [x.strip() for x in result]
print(result)
# result: ['West Team', 'Eastern']

CodePudding user response：

You want to:

remove '\n' and
any number including the space in front of the number
but not the space in the middle of the name.

Functions to use:

for constant parts you could just replace using str.replace().
for all dynamic matches we need a regex to substitute with empty-string using re.sub().
for surroundings we can even use str.strip() to remove leading and trailing whitespaces like \n.

Code

import re

input = "West Team 4, Eastern 3\n"

cleaned = re.sub(r'\s \d', '', input)  # remove numbers with leading spaces
cleaned = cleaned.strip()  # remove surrounding whitespace like \n
print(cleaned)

output = cleaned.split(",") 
print(output)

Prints:

West Team, Eastern
['West Team', 'Eastern']

CodePudding user response：

You can remove the digits and replace possible double spaced gaps with a single space.

Then split on a comma, do not keep empty values and trim the output:

import re

s = "West Team 4 , Eastern 3, test 23 99 test\n,"

res = [
    m.strip() for m in re.sub(r"[^\S\n]{2,}", " ", re.sub(r"\d ", "", s)).split(",") if m
]
print(res)

Output

['West Team', 'Eastern', 'test test']

See a Python demo.

CodePudding user response：

You haven't clearly defined the rules for getting the required output from your sample input. However, this will give what you've asked for but may not cover all eventualities:

in_string = 'West Team 4, Eastern 3\n'

result = [' '.join(t.split()[:-1]) for t in in_string.split(',')]

print(result)

Output:

['West Team', 'Eastern']

CodePudding user response：

So many ways this can be done, but looking at your data you could use rstrip() quite nicely:

s = 'West Team 4, Eastern 3\n'
lst = [x.rstrip('\n 0123456789') for x in s.split(', ')]
print(lst)

Or maybe rather use:

from string import digits
s = 'West Team 4, Eastern 3\n'
lst = [x.rstrip(digits '\n ') for x in s.split(', ')]
print(lst)

Both options print:

['West Team', 'Eastern']