Home > Net >  Python Regex Help - Greedy
Python Regex Help - Greedy

Time:07-23

I am trying to write a regular expression to select a range/paragraph of text. The Regex I have come up with so far, selects too far and I can't seem to figure out how to fix it. The regex I am using is:

SALINAS. [\s\S] (^---LETTUCE) ?

I am trying to pull out

SALINAS-WATSONVILLE CALIFORNIA                                                 
                                                                               
Sales F.O.B. Shipping Point and/or Delivered Sales, Shipping Point Basis 

VEGETABLES                                                                     
2022 Season                                                                    
---BROCCOLI: DEMAND FAIRLY LIGHT. MARKET SLIGHTLY LOWER. Extra services        
included. Wide range in quality and condition. cartons bchd 14s 9.55-13.55     
mostly 10.00-11.75 few 14.45-14.95 occasional lower bchd 18s 10.05-14.05 mostly
10.50-12.25 few 14.95-15.45 occasional lower 20 lb cartons loose Crown Cut     
10.00-14.85 mostly 11.00-12.75 few 15.50-15.95 Short Trim 12.00-15.85 mostly   
12.00-13.75 few 16.50-16.95 ORGANIC cartons bchd 14s 14.00-22.50 mostly        
16.55-18.75 few 24.95 occasional higher 20 lb cartons loose Crown Cut          
18.00-24.75 mostly 20.55-22.75 few 28.50-28.95                                 
---CAULIFLOWER: SUPPLY FAIRLY HEAVY. DEMAND FAIRLY LIGHT. MARKET ABOUT STEADY. 
Extra services included. Harvest curtailed by market conditions. cartons film  
wrapped White   9s 8.55-11.75 mostly 8.65-10.00 few 12.55 occasional higher and
lower 12s 8.45-12.75 mostly 9.45-10.65 few 13.50-13.55 occasional higher 16s   
7.55-11.55 mostly 8.45-9.65 few 12.55 ORGANIC cartons film wrapped White 9s    
9.00-16.50 mostly 13.55-15.50 one label 18.95 12s 12.00-17.50 mostly           
14.50-15.85 few 18.95 16s 12.00-16.50 mostly 12.00-14.50 one label 18.95       
---CELERY: DEMAND MODERATE. MARKET ABOUT STEADY. Extra services included. Wide 
range in quality and condition. cartons 2 dz 12.00-16.75 mostly 13.35-15.00 few
17.50-18.55 occasional lower 2 1/2 dz 12.00-16.75 mostly 14.06-15.50 few       
17.50-18.55 occasional lower 3 dz 14.06-17.55 mostly 14.06-16.65 one label     
18.45 cartons film bags Hearts 18s 17.06-20.55 mostly 17.50-19.06 few          
21.55-22.55 ORGANIC cartons 2 dz 14.00-17.50 mostly 14.50-16.75 few 18.50 one  
label 20.95 2 1/2 dz 14.50-17.50 mostly 14.50-16.75 few 18.50-18.56 one label  
20.95 cartons film bags Hearts 18s 14.50-18.95 mostly 16.00-17.85 occasional   
higher                                                                         
---LETTUCE-ICEBERG: DEMAND FAIRLY GOOD. MARKET SLIGHTLY HIGHER. Extra services 
included. Wide range in quality and condition. cartons flm lined 24s           
15.55-18.75 mostly 15.55-17.50 few 19.00-19.65 few 12.00-12.95 24s flmwrpd     
16.55-19.75 mostly 16.55-18.50 few 20.00-20.65 few 13.00-13.95 30s flmwrpd     
14.00-15.55 mostly 14.00-14.75 occasional higher ORGANIC cartons 24s flmwrpd   
16.00-20.50 mostly 16.00-18.50 12s flmwrpd 10.00-12.95 mostly 10.00-11.85      
---LETTUCE-OTHER: DEMAND FAIRLY LIGHT. MARKET ABOUT STEADY. Extra services     
included.  Wide range in quality and condition. cartons Boston 24s 10.50-14.75 
mostly 10.50-12.55 few 15.50-15.75 Green Leaf 24s 8.56-11.95 mostly 9.50-10.65 
few 12.06-12.50 one label 15.75 Red Leaf 24s 8.56-11.75 mostly 9.50-10.65 few  
12.05-12.75 ORGANIC cartons Green Leaf 24s 12.00-16.75 mostly 12.00-14.75 Red  
Leaf 24s 12.00-16.75 mostly 12.00-14.75                                        
---LETTUCE-ROMAINE: DEMAND HEARTS FAIRLY LIGHT, 24S MODERATE. MARKET ABOUT     
STEADY. Extra services included. Wide range in quality and condition. cartons  
24s 10.00-13.75 mostly 10.15-11.95 occasional higher cartons 12 3-count        
packages Hearts 12.85-17.95 mostly 14.50-16.75 few 18.05-18.75 occasional      
higher cartons film lined Hearts 48s 13.85-19.95 mostly 16.50-18.75 few        
20.00-20.75 ORGANIC cartons 24s 16.50-18.55  few 20.00-20.75 cartons 12 3-count
packages Hearts 13.85-20.50 mostly 17.55-18.95 few 22.50-22.75 

From the text file locate here: https://www.ams.usda.gov/mnreports/ix_fv120.txt

CodePudding user response:

If you include ? after a search statement like .*, it makes it lazy (instead of greedy). The below regex (from Regex101) matches just the paragraph. You can modify the regex to match any other city's vegetable market stock thing (?), or whatever it is.

SALINAS-WATSONVILLE CALIFORNIA(.|\n)*?MARKET(.|\n)*?\n\n (flags: gm)

Just change the name (SALINAS-WATSONVILLE CALIFORNIA) to change the city's market you select. Also notice that this includes the two newlines at the end (and also huge amount of trailing whitespace). As for the newlines, just make a group before \n\n at the end and select just that group (group 1). See Regex101 link.

(SALINAS-WATSONVILLE CALIFORNIA(.|\n)*?MARKET(.|\n)*?)\n\n (flags: gm)

CodePudding user response:

Regular expressions are great to match patterns but since you are probably using python (I inferred that from the title) and dealing with multiple lines at once, I would go for an easier approach.

You may try the following code:

# This is the location for which you want to extract the paragraph
location_to_find = 'SALINAS'

# Read all lines into a list. Each line ends with a '\n'
with open('ix_fv120.txt') as fr:
  lines = fr.readlines()

# Look for all occurrences of "Sales F.O.B." and get their location (i.e. line number). Because
# paragraphs start 2 lines before each occurrence of "Sales F.O.B.", we collect values of "n-2".
paragraph_poss = [(n-2)  for n,line in enumerate(lines)  if line.startswith('Sales F.O.B.')]

# Now search only in lines having a location name to see which one of them is for the location
# you are looking for, e.g. "SALINAS".
for n,cur_paragraph_line in enumerate(paragraph_poss):
  if location_to_find in lines[cur_paragraph_line]:
    if n == len(paragraph_poss)-1:
      paragraph = ''.join(lines[paragraph_poss[n]:])
    else:
      paragraph = ''.join(lines[paragraph_poss[n]:paragraph_poss[n 1]])
    print(paragraph)
    break
else:
  print('Error: Could not find "'   location   '" at the begining of any paragraph.')
  • Related