How do I get everything before the first underscore, and everything between the last underscore and the period in the file extension?
So far, I have everything before the first underscore, not sure what to do after that.
. ?(?=_)
EXAMPLES:
- 111111_SMITH, JIM_END TLD 6-01-20 THR LEWISHS.pdf
- 222222_JONES, MIKE_G URS TO 7.25 2-28-19 SA COOPSHS.pdf
DESIRED RESULTS:
- 111111_END TLD 6-01-20 THR LEWISHS
- 222222_MIKE_G URS TO 7.25 2-28-19 SA COOPSHS
CodePudding user response:
You could use 2 capture groups:
^([^_\n] _).*\b([^\s_]*_.*)(?=\.)
^
Start of string([^_\n] _)
Capture group 1, match any char except_
or a newline followed by matching a_
.*\b
Match the rest of the line and match a word boundary([^\s_]*_.*)
Capture group 2, optionally match any char except_
or a whitespace char, then match_
and the rest of the line(?=\.)
Positive lookahead, assert a.
to the right
See a regex demo.
Another option could be using a non greedy version to get to the first _
and make sure that there are no following underscores and then match the last dot:
^([^_\n] _).*?(\S*_[^_\n] )\.[^.\n] $
See another regex demo.
CodePudding user response:
Looks like you're very close. You could eliminate the names between the underscores by finding this (_. ?_) and replacing the returned value with a single underscore.
I am assuming that you did not intend your second result to include the name MIKE.
CodePudding user response:
You can match the following regular expression that contains no capture groups.
^[^_]*|(?!.*_).*(?=\.)
This expression can be broken down as follows.
^ # match the beginning of the string
[^_]* # match zero or more characters other than an underscore
| # or
(?! # begin negative lookahead
.*_ # match zero or more characters followed by an underscore
) # end negative lookahead
.* # match zero or more characters greedily
(?= # begin positive lookahead
\. # match a period
) # end positive lookahead
.*_
means to match zero or more characters greedily, followed by an underscore. To match greedily (the default) means to match as many characters as possible. Here that includes all underscores (if there are any) before the last one. Similarly, .*
followed by (?=\.)
means to match zero or more characters, possibly including periods, up to the last period.
Had I written .*?_
(incorrectly) it would match zero or more characters lazily, followed by an underscore. That means it would match as few characters as possible before matching an underscore; that is, it would match zero or more characters up to, but not including, the first underscore.
If instead of capturing the two parts of the string of interest you wanted to remove the two parts of the string you don't want (as suggested by the desired results of your example), you could substitute matches of the following regular expression with empty strings.
_.*_|\.[^.]*$
This regular expression reads, "Match an underscore followed by zero of more characters followed by an underscore, or match a period followed by zero or more characters that are not periods, followed by the end of the string".