I've never used regular expressions but know a little bit of the syntax after reading a bit on the python website.
I'm trying to write a regular expression that would match to entries in a text document. Each entry is on a newline.
For example here's a portion of the document I'm searching:
ABSOLVE AH0 B Z AA1 L V
ABSOLVE(1) AE0 B Z AA1 L V
ABSOLVED AH0 B Z AA1 L V D
ABSOLVED(1) AE0 B Z AA1 L V D
ABSOLVES AH0 B Z AA1 L V Z
ABSOLVES(1) AE0 B Z AA1 L V Z
if I had the string 'ABSOLVE'
that I'm searching the document for I want to find 'ABSOLVE'
, 'ABSOLVE(1)'
, but NOT 'ABSOLVED'
or anything else listed. Each line in this text document has a word followed by 2 spaces then some other data that should be collected after a match is found.
I'm just not sure if I should use $
or \b
flags in an re or what. My idea was to use something like
re.sub(r'([A-Z])(.*?)( )', r'\1\3', text)
Then just search from there. But this seems really inefficient considering I have a huge file (>100K lines). Is there a better way to use regex in this case? would I want to use re.sub() to strip the parentheses or re.search() here? It's a really big file I'm searching. Is there an expression I can use to collect the data after match until it hits a newline? I'd really appreciate any help here and just some advice for using regular expressions. Thanks
CodePudding user response:
- Regex is inefficient on large inputs, so you should only process your file line-by-line.
- Since you're searching for the text
ABSOLVE
, followed by a word boundary, optionally followed by a number in parentheses, you should include the textABSOLVE
in your regex. - If all you want is to extract the lines that start with (or include) this pattern, you should only include this pattern in your regex and test for that. Then, if the line contains the pattern, you can append it to a list.
- You need to escape parentheses because they are used in regex to indicate a capturing group.
- Compile the regex before executing it, so that it runs faster
Here's a regex you could use Try online:
ABSOLVE\b(\(\d \))?
Explanation:
------------
ABSOLVE The word ABSOLVE
\b A word boundary
\(\d \) One or more numbers enclosed in parentheses
( )? The previous group, zero or one times
search_regex = re.compile(r"ABSOLVE\b(\(\d \))?")
matching_lines = []
with open("fname.txt") as file:
for line in file:
if re.search(search_regex, line):
matching_lines.append(line)
Which gives
matching_lines = ['ABSOLVE AH0 B Z AA1 L V\n', 'ABSOLVE(1) AE0 B Z AA1 L V\n']
As a list comprehension:
with open("fname.txt") as file:
matching_lines = [line for line in file if re.search(search_regex, line)]
CodePudding user response:
I'm assuming your requirements are:
- Find a line that starts with a full world (e.g. "SOLVE" would not match the line with "ABSOLVE"
- The words are delimited by a non-letter character, but '-' are part of a word (thus POST-MODERNISM is a valid entry)
- You are just interested in finding the line (what you call entry)
- You are searching several times the same list
Due to requirement 4, it is more useful to pre-process the input. Here is a possible solution:
import re
from collections import defaultdict
def pre_process_input(filename):
"""Returns a dictionary using the first word in each line as key and a list of the entries as the result"""
pattern = re.compile(r'^([\w-] )') # \w is any word character, plus '-'
result = defaultdict(list)
with open(filename, 'r') as fh:
for line in fh:
key = pattern.match(line).group(1) # .group(1) gets the first parenthesised group
result[key].append(line.strip()) # Adds the full line to that entry in the dictionary
return result
# Example of use
data = pre_process_input('my_words.txt')
# Get all lines with ABSOLVE (note that this is case sensitive: it will not match 'absolve')
for entry in data['ABSOLVE']:
print(entry)
If you want the search not to be case sensitive, just normalise key
to be always lowercase and search the dict with lowercase only.