How to remove URLs(begin with brackets) by python's regex-CodePudding

How can I remove all the URLs in this string?

sentence = "There's a website called Swagbucks [https://www.swagbucks.com/p/register?rb=60437228](https://www.swagbucks.com/p/register?rb=60437228) which gives you the possibility to earn a few hundred bucks. Also the new website is [https://www.swagbucks.com/p/register?rb=60437228]."

The result should be:

sentence = "There's a website called Swagbucks which gives you the possibility to earn a few hundred bucks. Also the new website is ."

CodePudding user response：

Here is the "easy" answer:

def remove_bracketed_urls(sentence: str) -> str:
    return re.sub(r'\[[^]]*]', '', sentence)

It assumes there's no close square bracket in your URLs, assumes they're escaped as ].

The character class in there is accepting zero or more chars which are not a close bracket. So we look for open, then stuff, then close, and nuke it.

To remove parenthesized URLs, you'll want this regex:

r'\([^)]*\)'

Same concept -- it looks for open paren, stuff, close paren.

Here's an answer to a harder version of your question:

Assume that random junk sometimes appears within square brackets, and we wish to evaluate whether that junk corresponds to an URL. Let's iterate over each candidate.

from urllib.parse import urlparse

words = list(re.split(r'[[\]]', sentence)
for word in words:
    result = urlparse(word)
    ...

The rest is up to you. Implement your own rules for how to decide if you got a good parse result or not, and filter them out as desired.

The regex simply looks for brackets, either kind, open or close.

The corresponding expression for parens is simple, it needs no escapes:

r'[()]'

parsing: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlparse
escaping: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote_plus
regex character class: https://docs.python.org/3/howto/regex.html#matching-characters
"raw" r-strings: https://docs.python.org/3/library/re.html#raw-string-notation

You might be annoyed by doubled blanks in your output. Consider cleaning them up with this:

def normalize_blanks(s: str) -> str:
    words = s.split()
    return ' '.join(words)

The split on whitespace will treat a run of multiple blanks the same as a single blank.

CodePudding user response：

Possible solution is the following:

import re

string = """There's a website called Swagbucks [https://www.swagbucks.com/p/register?rb=60437228](https://www.swagbucks.com/p/register?rb=60437228) which gives you the possibility to earn a few hundred bucks. Also the new website is [https://www.swagbucks.com/p/register?rb=60437228]."""

result = re.sub('[[(].*?[])]', '', string)

print(result)

Prints

There's a website called Swagbucks  which gives you the possibility to earn a few hundred bucks. Also the new website is .

Regex explanation:

[[(] - [ or (
.*? - any character 0 or more times as less as possible
[])] - ] or )

Regex demo

CodePudding user response：

The Markdown links you are asking about have the form

[link text](http://example.url/)

and so your question should properly be, "how can I remove the text inside the square brackets, and the brackets and the parenthesized URL link immediately after?"

Try this:

re.sub(r'\s?\[[^][] ]\([^()<>\s] \)', '',text)

Or, if you wanted to preserve the link text,

re.sub(r'\s?\[([^][] )]\([^<>()\s] \)', r'\1', text)

This does not properly cope with link text which contains literal square brackets or URLs which contain literal round parentheses or whitespace, nor does it attempt to handle the alternate Markdown link syntax

[link text][1]

  [1]: http://example.url/

To also cope with the last URL in your example, which isn't valid Markdown, make the part with the round parentheses optional.

re.sub(r'\s?\[[^][] ](\([^()<>\s] \))?', '',text)