So I'm trying to pull all domains out of a text file. They may have a special character at the beginning (like a font tag).
so far: (?<=>).*?(?=com|net)
Text I'm searching:
thisdomain.com fake text >thatdomain.net
It is currently finding "thisdomain and thatdomain but of course it's cutting off the domain extension. I've dug through regex docs for about an hour and can't find a way to search between the > and .com with out cutting off the .com. Any suggestions?
CodePudding user response:
Use
(?<=>).*?(?:com|net)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
> '>'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
com 'com'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
net 'net'
--------------------------------------------------------------------------------
) end of grouping
CodePudding user response:
The reason you are not matching .com
is because this part (?=com|net)
is an assertion which is non consuming.
Instead you can just match either .com
or .net
including the dot preventing to match for example >clarinet
or >romcom
(?<=>)\S \.(?:com|net)
The pattern matches:
(?<=
Positive lookahead, assert what is directly to the left of the current position>
Match literally
)
Close the lookbehind\S
Match 1 or more non whitespace characters\.(?:com|net)
Match either.com
or.net
See a regex demo.
Without using lookarounds, you can also use a capture group and match the >
>(\S \.(?:com|net))
See another regex demo.