I am trying to write a regular expression to select a range/paragraph of text. The Regex I have come up with so far, selects too far and I can't seem to figure out how to fix it. The regex I am using is:
SALINAS. [\s\S] (^---LETTUCE) ?
I am trying to pull out
SALINAS-WATSONVILLE CALIFORNIA
Sales F.O.B. Shipping Point and/or Delivered Sales, Shipping Point Basis
VEGETABLES
2022 Season
---BROCCOLI: DEMAND FAIRLY LIGHT. MARKET SLIGHTLY LOWER. Extra services
included. Wide range in quality and condition. cartons bchd 14s 9.55-13.55
mostly 10.00-11.75 few 14.45-14.95 occasional lower bchd 18s 10.05-14.05 mostly
10.50-12.25 few 14.95-15.45 occasional lower 20 lb cartons loose Crown Cut
10.00-14.85 mostly 11.00-12.75 few 15.50-15.95 Short Trim 12.00-15.85 mostly
12.00-13.75 few 16.50-16.95 ORGANIC cartons bchd 14s 14.00-22.50 mostly
16.55-18.75 few 24.95 occasional higher 20 lb cartons loose Crown Cut
18.00-24.75 mostly 20.55-22.75 few 28.50-28.95
---CAULIFLOWER: SUPPLY FAIRLY HEAVY. DEMAND FAIRLY LIGHT. MARKET ABOUT STEADY.
Extra services included. Harvest curtailed by market conditions. cartons film
wrapped White 9s 8.55-11.75 mostly 8.65-10.00 few 12.55 occasional higher and
lower 12s 8.45-12.75 mostly 9.45-10.65 few 13.50-13.55 occasional higher 16s
7.55-11.55 mostly 8.45-9.65 few 12.55 ORGANIC cartons film wrapped White 9s
9.00-16.50 mostly 13.55-15.50 one label 18.95 12s 12.00-17.50 mostly
14.50-15.85 few 18.95 16s 12.00-16.50 mostly 12.00-14.50 one label 18.95
---CELERY: DEMAND MODERATE. MARKET ABOUT STEADY. Extra services included. Wide
range in quality and condition. cartons 2 dz 12.00-16.75 mostly 13.35-15.00 few
17.50-18.55 occasional lower 2 1/2 dz 12.00-16.75 mostly 14.06-15.50 few
17.50-18.55 occasional lower 3 dz 14.06-17.55 mostly 14.06-16.65 one label
18.45 cartons film bags Hearts 18s 17.06-20.55 mostly 17.50-19.06 few
21.55-22.55 ORGANIC cartons 2 dz 14.00-17.50 mostly 14.50-16.75 few 18.50 one
label 20.95 2 1/2 dz 14.50-17.50 mostly 14.50-16.75 few 18.50-18.56 one label
20.95 cartons film bags Hearts 18s 14.50-18.95 mostly 16.00-17.85 occasional
higher
---LETTUCE-ICEBERG: DEMAND FAIRLY GOOD. MARKET SLIGHTLY HIGHER. Extra services
included. Wide range in quality and condition. cartons flm lined 24s
15.55-18.75 mostly 15.55-17.50 few 19.00-19.65 few 12.00-12.95 24s flmwrpd
16.55-19.75 mostly 16.55-18.50 few 20.00-20.65 few 13.00-13.95 30s flmwrpd
14.00-15.55 mostly 14.00-14.75 occasional higher ORGANIC cartons 24s flmwrpd
16.00-20.50 mostly 16.00-18.50 12s flmwrpd 10.00-12.95 mostly 10.00-11.85
---LETTUCE-OTHER: DEMAND FAIRLY LIGHT. MARKET ABOUT STEADY. Extra services
included. Wide range in quality and condition. cartons Boston 24s 10.50-14.75
mostly 10.50-12.55 few 15.50-15.75 Green Leaf 24s 8.56-11.95 mostly 9.50-10.65
few 12.06-12.50 one label 15.75 Red Leaf 24s 8.56-11.75 mostly 9.50-10.65 few
12.05-12.75 ORGANIC cartons Green Leaf 24s 12.00-16.75 mostly 12.00-14.75 Red
Leaf 24s 12.00-16.75 mostly 12.00-14.75
---LETTUCE-ROMAINE: DEMAND HEARTS FAIRLY LIGHT, 24S MODERATE. MARKET ABOUT
STEADY. Extra services included. Wide range in quality and condition. cartons
24s 10.00-13.75 mostly 10.15-11.95 occasional higher cartons 12 3-count
packages Hearts 12.85-17.95 mostly 14.50-16.75 few 18.05-18.75 occasional
higher cartons film lined Hearts 48s 13.85-19.95 mostly 16.50-18.75 few
20.00-20.75 ORGANIC cartons 24s 16.50-18.55 few 20.00-20.75 cartons 12 3-count
packages Hearts 13.85-20.50 mostly 17.55-18.95 few 22.50-22.75
From the text file locate here: https://www.ams.usda.gov/mnreports/ix_fv120.txt
CodePudding user response:
If you include ?
after a search statement like .*
, it makes it lazy (instead of greedy). The below regex (from Regex101) matches just the paragraph. You can modify the regex to match any other city's vegetable market stock thing (?), or whatever it is.
SALINAS-WATSONVILLE CALIFORNIA(.|\n)*?MARKET(.|\n)*?\n\n (flags: gm)
Just change the name (SALINAS-WATSONVILLE CALIFORNIA
) to change the city's market you select. Also notice that this includes the two newlines at the end (and also huge amount of trailing whitespace). As for the newlines, just make a group before \n\n
at the end and select just that group (group 1). See Regex101 link.
(SALINAS-WATSONVILLE CALIFORNIA(.|\n)*?MARKET(.|\n)*?)\n\n (flags: gm)
CodePudding user response:
Regular expressions are great to match patterns but since you are probably using python (I inferred that from the title) and dealing with multiple lines at once, I would go for an easier approach.
You may try the following code:
# This is the location for which you want to extract the paragraph
location_to_find = 'SALINAS'
# Read all lines into a list. Each line ends with a '\n'
with open('ix_fv120.txt') as fr:
lines = fr.readlines()
# Look for all occurrences of "Sales F.O.B." and get their location (i.e. line number). Because
# paragraphs start 2 lines before each occurrence of "Sales F.O.B.", we collect values of "n-2".
paragraph_poss = [(n-2) for n,line in enumerate(lines) if line.startswith('Sales F.O.B.')]
# Now search only in lines having a location name to see which one of them is for the location
# you are looking for, e.g. "SALINAS".
for n,cur_paragraph_line in enumerate(paragraph_poss):
if location_to_find in lines[cur_paragraph_line]:
if n == len(paragraph_poss)-1:
paragraph = ''.join(lines[paragraph_poss[n]:])
else:
paragraph = ''.join(lines[paragraph_poss[n]:paragraph_poss[n 1]])
print(paragraph)
break
else:
print('Error: Could not find "' location '" at the begining of any paragraph.')