Remove a specific URL from data frame-CodePudding

In Python I've a dataframe that contains in a column two comma separated URLS (https://pippo.it, https://pluto.it) and another column where the urls I want to remove from all the dataframe are stored. How do I accomplish this?

Example code

df = df = pd.DataFrame({'urls':['https://pippo.it, https://pluto.it', 'http://blah.com'], 'urls2':['http://blah.net', 'https://pippo.it, https://pluto.it']})
df2 = df

for column in df:

    URLVal = df["url2"].values
    df2 = df2.replace(str(URLVal.values), "")

CodePudding user response：

You may try replacing using a regex:

df["urls"] = df["urls"].str.replace(r'(?:,\s*)?https?://pluto.it(?:,\s*)?', '', regex=True)

Here is a regex demo showing that the replacement logic is working.

CodePudding user response：

I want to explain why

I've tried the solution df.replace("https://pluto.it", "")

failed. In pandas there exist distinct methods, pandas.DataFrame.replace (for replacing whole elements) and pandas.Series.str.replace (akin to replace method of str, note that it pertains to Series, i.e. single column of DataFrame). Consider following simple example

import pandas as pd
df = pd.DataFrame({"A":["ABC","ABCDEF","DEF"]})
print(df.replace("DEF","X"))  # holds ABC, ABCDEF, X
print(df.A.str.replace("DEF","X"))  # holds ABC, ABCX, X

CodePudding user response：

You can use

df = df.replace(r"\s*(?:,\s*)?https://pluto\.it\b", "", regex=True)
df.replace(r"\s*(?:,\s*)?https://pluto\.it\b", "", regex=True, inplace=True)

See a Pandas test:

import pandas as pd
df = pd.DataFrame({'urls':['https://pippo.it, https://pluto.it', 'http://blah.com'], 'urls2':['http://blah.net', 'https://pippo.it, https://pluto.it']})
df.replace(r"\s*(?:,\s*)?https://pluto\.it\b", "", regex=True, inplace=True)

So, if the initial dataframe looks like

                                 urls                               urls2
0  https://pippo.it, https://pluto.it                     http://blah.net
1                     http://blah.com  https://pippo.it, https://pluto.it

The output will be:

               urls             urls2
0  https://pippo.it   http://blah.net
1   http://blah.com  https://pippo.it

The inplace=True makes the changes directly to the dataframe, no need to reassign the variable.

The \s*(?:,\s*)?https://pluto\.it\b regex needs more attention:

\s* - zero or more whitespaces
(?:,\s*)? - an optional sequence of a comma and then zero or more whitespaces
https://pluto\.it - a literal https://pluto.it string
\b - a word boundary (used to match it but not ita, it0, it_ etc.). Note: If you want to make sure there is end of string, use $. If you want to make sure there can be a whitespace or end of string only after the URL, use (?!\S) instead of \b.

CodePudding user response：

Assuming the structure of the string element to be replaced is always the same, you could also do it very easily and without Regex as follows:

df.replace("https://pippo.it, https://pluto.it", "https://pippo.it")