Home > Software engineering >  Remove every website ending with .com except one using regex
Remove every website ending with .com except one using regex

Time:12-02

I'm currently struggling with regex. I'm trying to substitute every website ending with a ".com" except one, that is "crypto.com" as it's not a website per se but also the name of a cryptocurrency.

Let's take this sentence:

"Here are my favorite things: crypto.com, polo.com, cryp.com and google.com"

Inspired by this answer, this is my Python regex:

r"(\w \.)?crypto\.com"

The problem, using https://regex101.com to test it out, is that it's capturing only the crpyto.com, but not the others (which is what I want to do).

Can anyone tell me how to proceed? Thank you!

Expected code:

text = "Here are my favorite things: crypto.com, polo.com, cryp.com and google.com"    
text = re.sub(r"(\w \.)?crypto\.com", '', text )

Expected output:

"Here are my favorite things: crypto.com,, and "

CodePudding user response:

You can use

\s*\b(?!crypto\.)\w \.com\b

See the regex demo. Details:

  • \s* - zero or more whitespaces
  • \b - a word boundary
  • (?!crypto\.) - a negative lookahead that fails the match if there is crypto. string immediately to the right of the current location
  • \w - one or more word chars
  • \.com - .com
  • \b - a word boundary.

See the Python demo:

import re
text = "Here are my favorite things: crypto.com, polo.com, cryp.com and google.com"
print( re.sub(r'\s*\b(?!crypto\.)\w \.com\b', '', text) )
# => Here are my favorite things: crypto.com,, and

A more comprehensive regex can also be used to remove commas and the word and:

(?:\s*(?:,|and\s*)?)\b(?!crypto\.)\w \.com,?

See this regex demo.

CodePudding user response:

Use a negative look-around:

(\w )?(?<!crypto)\.com

Edit: The question changed slightly I removed a \. that was incorrect, now it should work!

  • Related