Home > front end >  Regex - each occurence from text
Regex - each occurence from text

Time:10-03

I'm trying to do some parsing of weather alerts from NOAA/NWS and I must admit my regex is not great and I'm struggling with one concept.

Basically the alerts force number of characters on a line so there are many line breaks that I want to ignore so I can grab "concepts" out of the text.

What I'd like to do is pull out all the * items as individual items that I can format to bullets later

It appears that the bullets are \n\n to break them. So I've tried

/\* (?:\X*)(?: \n){2}/

This seems to select all the way to the line before "* Associated". What I'd like to get is each * grouping selected individually.

Below is a sample of what I'm working with.

I'm certain for a person with better regex this is simple so I'm hoping someone can give me some guidance. Thanks!

...FLASH FLOOD WATCH REMAINS IN EFFECT UNTIL 7 AM AKDT THIS 
MORNING... 
The Flash Flood Watch continues for: 
 
* Including the following area, Cape Decision to Salisbury Sound 
Coastal Area. 
 
* Until 7 AM AKDT this morning. 
 
* The low tracked inland over the northeast gulf coast early 
Saturday morning. This storm produced 48 hour rain amounts ranging 
from 2 to 4 inches with possible localized higher amounts. Locally 
heavy showers will spread onshore through Saturday morning. 
Antecedent soil moisture values are similar to last October which 
produced landslides in the Sitka area. 
 
* Associated with the moderate to heavy rainfall, high winds will 
increase the risk of isolated landslides in prone areas late 
Friday afternoon into Friday night. There will be sharp rises on 
small streams through Friday with potential flooding by Saturday 
morning.

CodePudding user response:

You can use the regular expression

(?<=^\* )(?:[^\n]|\n(?! *\n))*

with the multiline option invoked to stipulate that ^ and $ match the beginning of each line rather than (the default) the beginning and end of the string.

See it in action!.

Consider the string

* Including the following area, Cape Decision to Salisbury Sound 
Coastal Area. 

To see all the spaces and newlines in the string I've converted the spaces to underscores and the newlines to octothorps, also know as hashtags or pound signs ('#'):

 *_Including_the_following_area,_Cape_Decision_to_Salisbury_Sound_#Coastal_Area._##

Note the space and newline following "Sound" and the space preceding the two consecutive newlines at the end. We obviously must account for these in the regular expression.

We can write the regular expression in free-spacing mode to make it self-documenting. I will use a generic syntax.1.

/
(?<=      # begin a negative lookbehind
  ^       # match the beginning of a line
  \*      # match '*'
  \       # match a space (escaped space character)
)         # end negative lookbehind
(?:       # begin non-capture group
  [^\n]   # match any character other than a newline
  |       # or
  \n      # match a newline
  (?!     # begin negative lookahead
    \ *   # match zero or more (*) spaces
    \n    # match a newline
  )       # end negative lookahead
)         # end non-capture group
*         # execute non-capture group zero or more times
/xm       # invoke free-spacing and multiline modes

1. Note that I've escaped the two space characters. That's because when a regular expression is parsed in free-spacing mode all unprotected whitespaces (and comments) are stripped out before it is evaluated. Escaping a space protects it, though it may not be obvious to the reader that a space is present when the expression written in free-spacing mode. Depending on the language I prefer protecting the space in a way that is more obvious. In Ruby, for example, there are several options, one being to put the space in a character class ([ ]). Incidently, if one wishes to match a space the match should be against a space character, not against a whitespace character (\s). The fact that newlines are whitespace characters can be particularly troublesome when only spaces are to be matched.

CodePudding user response:

I may have figured it out but I would still like to get some comments if this is the right way to do this.

What I did was look for the double \n as the terminator but also the case where the * might be the last item so a ".$". I also did ungreedy in the regex too /gU.

\* (?:\X*)(?:(?: \n){2}|\.$)

Does this make sense?

  • Related