Home > Software engineering >  Regex, All before an underscore, and all between second underscore and the last period?
Regex, All before an underscore, and all between second underscore and the last period?

Time:02-26

How do I get everything before the first underscore, and everything between the last underscore and the period in the file extension?

So far, I have everything before the first underscore, not sure what to do after that.

. ?(?=_)

EXAMPLES:

  • 111111_SMITH, JIM_END TLD 6-01-20 THR LEWISHS.pdf
  • 222222_JONES, MIKE_G URS TO 7.25 2-28-19 SA COOPSHS.pdf

DESIRED RESULTS:

  • 111111_END TLD 6-01-20 THR LEWISHS
  • 222222_MIKE_G URS TO 7.25 2-28-19 SA COOPSHS

CodePudding user response:

You could use 2 capture groups:

^([^_\n] _).*\b([^\s_]*_.*)(?=\.)
  • ^ Start of string
  • ([^_\n] _) Capture group 1, match any char except _ or a newline followed by matching a _
  • .*\b Match the rest of the line and match a word boundary
  • ([^\s_]*_.*) Capture group 2, optionally match any char except _ or a whitespace char, then match _ and the rest of the line
  • (?=\.) Positive lookahead, assert a . to the right

See a regex demo.

Another option could be using a non greedy version to get to the first _ and make sure that there are no following underscores and then match the last dot:

^([^_\n] _).*?(\S*_[^_\n] )\.[^.\n] $

See another regex demo.

CodePudding user response:

Looks like you're very close. You could eliminate the names between the underscores by finding this (_. ?_) and replacing the returned value with a single underscore.

I am assuming that you did not intend your second result to include the name MIKE.

CodePudding user response:

You can match the following regular expression that contains no capture groups.

^[^_]*|(?!.*_).*(?=\.)

Demo

This expression can be broken down as follows.

^        # match the beginning of the string
[^_]*    # match zero or more characters other than an underscore
|        # or
(?!      # begin negative lookahead
  .*_    # match zero or more characters followed by an underscore
)        # end negative lookahead
.*       # match zero or more characters greedily
(?=      # begin positive lookahead
  \.     # match a period
)        # end positive lookahead

.*_ means to match zero or more characters greedily, followed by an underscore. To match greedily (the default) means to match as many characters as possible. Here that includes all underscores (if there are any) before the last one. Similarly, .* followed by (?=\.) means to match zero or more characters, possibly including periods, up to the last period.

Had I written .*?_ (incorrectly) it would match zero or more characters lazily, followed by an underscore. That means it would match as few characters as possible before matching an underscore; that is, it would match zero or more characters up to, but not including, the first underscore.


If instead of capturing the two parts of the string of interest you wanted to remove the two parts of the string you don't want (as suggested by the desired results of your example), you could substitute matches of the following regular expression with empty strings.

_.*_|\.[^.]*$

Demo

This regular expression reads, "Match an underscore followed by zero of more characters followed by an underscore, or match a period followed by zero or more characters that are not periods, followed by the end of the string".

  • Related