Home > Blockchain >  Regex between two strings - include last string
Regex between two strings - include last string

Time:03-07

So I'm trying to pull all domains out of a text file. They may have a special character at the beginning (like a font tag).

so far: (?<=>).*?(?=com|net)

Text I'm searching:

thisdomain.com fake text >thatdomain.net

It is currently finding "thisdomain and thatdomain but of course it's cutting off the domain extension. I've dug through regex docs for about an hour and can't find a way to search between the > and .com with out cutting off the .com. Any suggestions?

CodePudding user response:

Use

(?<=>).*?(?:com|net)

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
    >                        '>'
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  .*?                      any character except \n (0 or more times
                           (matching the least amount possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    com                      'com'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    net                      'net'
--------------------------------------------------------------------------------
  )                        end of grouping

CodePudding user response:

The reason you are not matching .com is because this part (?=com|net) is an assertion which is non consuming.

Instead you can just match either .com or .net including the dot preventing to match for example >clarinet or >romcom

(?<=>)\S \.(?:com|net)

The pattern matches:

  • (?<= Positive lookahead, assert what is directly to the left of the current position
    • > Match literally
  • ) Close the lookbehind
  • \S Match 1 or more non whitespace characters
  • \.(?:com|net) Match either .com or .net

See a regex demo.

Without using lookarounds, you can also use a capture group and match the >

>(\S \.(?:com|net))

See another regex demo.

  • Related