The problem: I have a text file which represents a list of trips. The characters which indicate a separation between two trips are: a space, 132 '-' chars, and then a new line. I'm trying to find those trips which contain the text 'CUN'. Note that the '-' also can appear in trips, but not in a group of 132; at most, they will appear two in a row.
To give an idea of what the data itself looks like (with slight redactions), here is a sample:
------------------------------------------------------------------------------------------------------------------------------------
ALPA MEAL CODE UPDATES: M* => PILOT PAID MEAL Mm => MEAL MOVED TO THIS SEGMENT Mc => THIS MEAL WAS CHANGED
1DSL EFF 04/01/19 THRU 04/30/19 737 LOS ANGELES APR 2019 04/01/19 PAGE 58
EQP FLT# DPT ARV DPTR ARVL GRND ML FTM ACM DTM IND D/C SU|MO TU WE TH FR|SA
EFF 04/06/19 THRU 04/06/19 ID L5265 - BASIC (HNL) |-- -- -- -- --| 6
RPT: 0803 --|-- -- -- -- --|--
37K 489 LAX DEN 0903 1228 1.02 2.25 2.25 --|-- -- -- -- --|--
37K 1472 DEN SAN 1330 1447 16.38 L 2.17 4.42 6.59 .00 --|-- -- -- -- --|--
RLS: 1502 HTL: XXXXXX SAN DIEGO XXX-XXX-XXXX OP=> XXX-XXX-XXXX --|-- --
XXXXXX XXX-XXX-XXXX
RPT: 0640
37X 662 SAN SFO 0725 0907 1.39 B 1.42 1.42
37K 1122 SFO HNL 1046 1328 20.32 L S 5.42 7.24 10.03 .00
RLS: 1343 HTL: XXXXXX WAIKIKI XXX-XXX-XXXX OP=> XXX-XXX-XXXX
XXXXXX XXX-XXX-XXXX
RPT: 0915
37K 1221 HNL LAX 1000 1834 1.16 L S 5.34 5.34
37K 2048 LAX LAS 1950 2112 17.23 1.22 6.56 9.12 .00
RLS: 2127 HTL: XXXXXX VEGAS XXX-XXX-XXXX OP=> XXX-XXX-XXXX
XXXXXX XXX-XXX-XXXX
RPT: 1350
37K 640 LAS ORD 1435 2012 1.03 3.37 3.37
37K 1186 ORD LAX 2115 2344 D 4.29 8.06 10.09 .00
RLS: 2359
DAYS- 4 CRD-27.08* FTM-27.08* TAFB- 87.56 INT- .00 NTE- .00 M$-229.74 T/C- .00 .34*
------------------------------------------------------------------------------------------------------------------------------------
EFF 04/12/19 THRU 04/12/19 ID L5266 - BASIC |-- -- -- -- --|--
RPT: 0803 --|-- -- -- -- 12|--
73Q 489 LAX DEN 0903 1228 1.17 2.25 2.25 --|-- -- -- -- --|--
37K 716 DEN AUS 1345 1653 1.02 L 2.08 4.33 --|-- -- -- -- --|--
37K 630 AUS IAH 1755 1855 1.01 1.00 5.33 --|-- --
37K 530 IAH SAT 1956 2059 17.10 1.03 6.36 11.11 .00
RLS: 2114 HTL: XXXXXX SAN ANTONIO XXX-XXX-XXXX OP=> XXX-XXX-XXXX
XXXXXX XXX-XXX-XXXX
RPT: 1324
73G 1967 SAT ORD 1409 1655 1.10 2.46 2.46
37K 246 ORD TPA 1805 2155 21.05 D 2.50 5.36 7.46 .00
RLS: 2210 HTL: XXXXXX TAMPA XXX-XXX-XXXX OP=> XXX-XXX-XXXX
XXXXXX
RPT: 1815
37K 352 TPA IAD 1900 2109 1.06 2.09 2.09
37K 1448 IAD LAX 2215 0049 D S 5.34 7.43 9.49 .00
RLS: 0104
DAYS- 4 CRD-20.00* FTM-19.55* TAFB- 65.01 INT- .00 NTE- 1.49 M$-159.29 T/C- .05 .25*
------------------------------------------------------------------------------------------------------------------------------------
EFF 04/15/19 THRU 04/15/19 ID L5267 - BASIC (CAM) |-- -- -- -- --|--
RPT: 0803 --|-- -- -- -- --|--
73Q 489 LAX DEN 0903 1228 3.40 2.25 2.25 --|15 -- -- -- --|--
37K 543 DEN MCO 1608 2137 15.53 3.29 5.54 10.49 .00 --|-- -- -- -- --|--
RLS: 2152 HTL: XXXXXX ORLANDO XXX-XXX-XXXX OP=> XXX-XXX-XXXX --|-- --
XXXXXX XXX-XXX-XXXX
RPT: 1245
73Q 1601 MCO EWR 1330 1613 3.06 2.43 2.43
37K 1054 EWR CUN 1919 2220 18.40 D 4.01 6.44 11.05 .00
RLS: 2250 HTL: XXXXXX CANCUN XXX-XXX-XXXX OP=> XXX-XXX-XXXX
XXXXXX XXX-XXX-XXXX
RPT: 1615
37K 1118 CUN SFO 1700 2051 1.54 D S 5.51 5.51
37K 257 SFO LAX 2245 0020 1.35 7.26 10.20 .00
RLS: 0035
DAYS- 4 CRD-20.04* FTM-20.04* TAFB- 64.32 INT- 9.52 NTE- .00 M$-170.94 T/C- .00 .25*
------------------------------------------------------------------------------------------------------------------------------------
With python or similar, this is trivial to do; simply split the string on the divider ' ' '-' * 132 '\n'
, and search within each section for 'CUN'.
So, this is not a question on how to do this programmatically, but more specifically whether it can be done only with regexes. I was looking over the file with a text editor (Sublime Text, which uses the perl Boost library), and the 'obvious' way to find the trips that had CUN in them didn't work:
(?s)^ -{132}\n. ?CUN. ? -{132}\n
Even though the parts of the trip before and after CUN are caputred by the non-greedy .
, the problem is the expression matches on the first divider, then all the trips afterwards until it finds one with CUN in it, then finishes on the next divider.
If the divder between the sections was a single, unique character, this would be trivially easy to do. If the divider was, for instance, '#', and it didn't appear in the body of the trip, we could use:
(?s)#[^#] ?CUN[^#] ?#
I don't think this is the same case as attempting to parse parts of HTML or XML or other non-regular languages; it feels like it should be much simpler. But what I feel, of course, may not be so. So, my question, is can this be done with just regexes easily? How, specifically? Or is it a typical kind of problem that regexes are bad at?
To state the problem more abstractly:
Consider a text file that possesses sections; each section is separated from another by a unique string of characters called the divider. While some of characters within the divider may appear in the sections, the entire divider will never appear within a section; in fact the sections are defined by the dividers. What regular expression will allow you to capture an entire section (and only that section) when some part of the section contains some simple string of interest? Or can this not be done?
CodePudding user response:
If you use Sublime, you can find a whole block containing the word CUN
with:
^ -{132}(?:\R(?! -{132}$|.*\bCUN\b).*)* \R.*\bCUN\b.*(?:\R(?! -{132}$).*)*
Explanation
^
Start of string-{132}
Match a space and 132 hyphens(?:
Non capture group\R(?! -{132}$|.*\bCUN\b).*
Match a newline, and the rest of the line if it does not contains the newline sequence or the wordCUN
)*
Close the non capture group and optionally repeat using a possessive quantifier to prevent some backtracking if there is no match\R
Match a newline.*\bCUN\b.*
match a line with the wordCUN
(?:\R(?! -{132}$).*)*
Optionally match the rest of the lines not starting with the hyphens