Home > Enterprise >  Using a regex to find specific text between two multi-character dividers (Not HTML)
Using a regex to find specific text between two multi-character dividers (Not HTML)

Time:08-04

The problem: I have a text file which represents a list of trips. The characters which indicate a separation between two trips are: a space, 132 '-' chars, and then a new line. I'm trying to find those trips which contain the text 'CUN'. Note that the '-' also can appear in trips, but not in a group of 132; at most, they will appear two in a row.

To give an idea of what the data itself looks like (with slight redactions), here is a sample:

 ------------------------------------------------------------------------------------------------------------------------------------
 ALPA MEAL CODE UPDATES:   M* => PILOT PAID MEAL   Mm => MEAL MOVED TO THIS SEGMENT   Mc => THIS MEAL WAS CHANGED                    
1DSL EFF 04/01/19 THRU 04/30/19    737    LOS ANGELES               APR 2019               04/01/19               PAGE  58           
    EQP    FLT# DPT ARV DPTR ARVL  GRND  ML      FTM   ACM    DTM IND   D/C                                  SU|MO TU WE TH FR|SA    
 EFF 04/06/19 THRU 04/06/19                                                  ID L5265  - BASIC   (HNL)         |-- -- -- -- --| 6    
              RPT: 0803                                                                                      --|-- -- -- -- --|--    
    37K     489 LAX DEN 0903 1228   1.02         2.25  2.25                                                  --|-- -- -- -- --|--    
    37K    1472 DEN SAN 1330 1447  16.38 L       2.17  4.42   6.59      .00                                  --|-- -- -- -- --|--    
              RLS: 1502      HTL: XXXXXX SAN DIEGO             XXX-XXX-XXXX       OP=> XXX-XXX-XXXX          --|-- --                
                                    XXXXXX                          XXX-XXX-XXXX                                                     
              RPT: 0640                                                                                                              
    37X     662 SAN SFO 0725 0907   1.39 B       1.42  1.42                                                                          
    37K    1122 SFO HNL 1046 1328  20.32 L S     5.42  7.24  10.03      .00                                                          
              RLS: 1343      HTL: XXXXXX WAIKIKI               XXX-XXX-XXXX       OP=> XXX-XXX-XXXX                                  
                                     XXXXXX                         XXX-XXX-XXXX                                                     
              RPT: 0915                                                                                                              
    37K    1221 HNL LAX 1000 1834   1.16 L S     5.34  5.34                                                                          
    37K    2048 LAX LAS 1950 2112  17.23         1.22  6.56   9.12      .00                                                          
              RLS: 2127      HTL: XXXXXX VEGAS                 XXX-XXX-XXXX       OP=> XXX-XXX-XXXX                                  
                                     XXXXXX                         XXX-XXX-XXXX                                                     
              RPT: 1350                                                                                                              
    37K     640 LAS ORD 1435 2012   1.03         3.37  3.37                                                                          
    37K    1186 ORD LAX 2115 2344        D       4.29  8.06  10.09      .00                                                          
              RLS: 2359                                                                                                              
                 DAYS- 4 CRD-27.08* FTM-27.08* TAFB- 87.56 INT-   .00 NTE-   .00 M$-229.74 T/C-  .00   .34*                          
 ------------------------------------------------------------------------------------------------------------------------------------
 EFF 04/12/19 THRU 04/12/19                                                  ID L5266  - BASIC                 |-- -- -- -- --|--    
              RPT: 0803                                                                                      --|-- -- -- -- 12|--    
    73Q     489 LAX DEN 0903 1228   1.17         2.25  2.25                                                  --|-- -- -- -- --|--    
    37K     716 DEN AUS 1345 1653   1.02 L       2.08  4.33                                                  --|-- -- -- -- --|--    
    37K     630 AUS IAH 1755 1855   1.01         1.00  5.33                                                  --|-- --                
    37K     530 IAH SAT 1956 2059  17.10         1.03  6.36  11.11      .00                                                          
              RLS: 2114      HTL: XXXXXX SAN ANTONIO           XXX-XXX-XXXX       OP=> XXX-XXX-XXXX                                  
                                     XXXXXX                         XXX-XXX-XXXX                                                     
              RPT: 1324                                                                                                              
    73G    1967 SAT ORD 1409 1655   1.10         2.46  2.46                                                                          
    37K     246 ORD TPA 1805 2155  21.05 D       2.50  5.36   7.46      .00                                                          
              RLS: 2210      HTL: XXXXXX TAMPA                 XXX-XXX-XXXX       OP=> XXX-XXX-XXXX                                  
                                     XXXXXX                                                                                  
              RPT: 1815                                                                                                              
    37K     352 TPA IAD 1900 2109   1.06         2.09  2.09                                                                          
    37K    1448 IAD LAX 2215 0049        D S     5.34  7.43   9.49      .00                                                          
              RLS: 0104                                                                                                              
                 DAYS- 4 CRD-20.00* FTM-19.55* TAFB- 65.01 INT-   .00 NTE-  1.49 M$-159.29 T/C-  .05   .25*                          
 ------------------------------------------------------------------------------------------------------------------------------------
 EFF 04/15/19 THRU 04/15/19                                                  ID L5267  - BASIC   (CAM)         |-- -- -- -- --|--    
              RPT: 0803                                                                                      --|-- -- -- -- --|--    
    73Q     489 LAX DEN 0903 1228   3.40         2.25  2.25                                                  --|15 -- -- -- --|--    
    37K     543 DEN MCO 1608 2137  15.53         3.29  5.54  10.49      .00                                  --|-- -- -- -- --|--    
              RLS: 2152      HTL: XXXXXX ORLANDO               XXX-XXX-XXXX       OP=> XXX-XXX-XXXX          --|-- --                
                                     XXXXXX                         XXX-XXX-XXXX                                                     
              RPT: 1245                                                                                                              
    73Q    1601 MCO EWR 1330 1613   3.06         2.43  2.43                                                                          
    37K    1054 EWR CUN 1919 2220  18.40 D       4.01  6.44  11.05      .00                                                          
              RLS: 2250      HTL: XXXXXX CANCUN                XXX-XXX-XXXX       OP=> XXX-XXX-XXXX                                  
                                     XXXXXX                         XXX-XXX-XXXX                                                     
              RPT: 1615                                                                                                              
    37K    1118 CUN SFO 1700 2051   1.54 D S     5.51  5.51                                                                          
    37K     257 SFO LAX 2245 0020                1.35  7.26  10.20      .00                                                          
              RLS: 0035                                                                                                              
                 DAYS- 4 CRD-20.04* FTM-20.04* TAFB- 64.32 INT-  9.52 NTE-   .00 M$-170.94 T/C-  .00   .25*                          
 ------------------------------------------------------------------------------------------------------------------------------------

With python or similar, this is trivial to do; simply split the string on the divider ' ' '-' * 132 '\n', and search within each section for 'CUN'.

So, this is not a question on how to do this programmatically, but more specifically whether it can be done only with regexes. I was looking over the file with a text editor (Sublime Text, which uses the perl Boost library), and the 'obvious' way to find the trips that had CUN in them didn't work:

(?s)^ -{132}\n. ?CUN. ? -{132}\n

Even though the parts of the trip before and after CUN are caputred by the non-greedy ., the problem is the expression matches on the first divider, then all the trips afterwards until it finds one with CUN in it, then finishes on the next divider.

If the divder between the sections was a single, unique character, this would be trivially easy to do. If the divider was, for instance, '#', and it didn't appear in the body of the trip, we could use:

(?s)#[^#] ?CUN[^#] ?#

I don't think this is the same case as attempting to parse parts of HTML or XML or other non-regular languages; it feels like it should be much simpler. But what I feel, of course, may not be so. So, my question, is can this be done with just regexes easily? How, specifically? Or is it a typical kind of problem that regexes are bad at?

To state the problem more abstractly:

Consider a text file that possesses sections; each section is separated from another by a unique string of characters called the divider. While some of characters within the divider may appear in the sections, the entire divider will never appear within a section; in fact the sections are defined by the dividers. What regular expression will allow you to capture an entire section (and only that section) when some part of the section contains some simple string of interest? Or can this not be done?

CodePudding user response:

If you use Sublime, you can find a whole block containing the word CUN with:

^ -{132}(?:\R(?! -{132}$|.*\bCUN\b).*)* \R.*\bCUN\b.*(?:\R(?! -{132}$).*)*

Explanation

  • ^ Start of string
  • -{132} Match a space and 132 hyphens
  • (?: Non capture group
    • \R(?! -{132}$|.*\bCUN\b).* Match a newline, and the rest of the line if it does not contains the newline sequence or the word CUN
  • )* Close the non capture group and optionally repeat using a possessive quantifier to prevent some backtracking if there is no match
  • \R Match a newline
  • .*\bCUN\b.* match a line with the word CUN
  • (?:\R(?! -{132}$).*)* Optionally match the rest of the lines not starting with the hyphens

Regex demo

  • Related