In Python I've a dataframe that contains in a column two comma separated URLS (https://pippo.it, https://pluto.it) and another column where the urls I want to remove from all the dataframe are stored. How do I accomplish this?
Example code
df = df = pd.DataFrame({'urls':['https://pippo.it, https://pluto.it', 'http://blah.com'], 'urls2':['http://blah.net', 'https://pippo.it, https://pluto.it']})
df2 = df
for column in df:
URLVal = df["url2"].values
df2 = df2.replace(str(URLVal.values), "")
CodePudding user response:
You may try replacing using a regex:
df["urls"] = df["urls"].str.replace(r'(?:,\s*)?https?://pluto.it(?:,\s*)?', '', regex=True)
Here is a regex demo showing that the replacement logic is working.
CodePudding user response:
I want to explain why
I've tried the solution df.replace("https://pluto.it", "")
failed. In pandas
there exist distinct methods, pandas.DataFrame.replace
(for replacing whole elements) and pandas.Series.str.replace
(akin to replace method of str
, note that it pertains to Series, i.e. single column of DataFrame). Consider following simple example
import pandas as pd
df = pd.DataFrame({"A":["ABC","ABCDEF","DEF"]})
print(df.replace("DEF","X")) # holds ABC, ABCDEF, X
print(df.A.str.replace("DEF","X")) # holds ABC, ABCX, X
CodePudding user response:
You can use
df = df.replace(r"\s*(?:,\s*)?https://pluto\.it\b", "", regex=True)
df.replace(r"\s*(?:,\s*)?https://pluto\.it\b", "", regex=True, inplace=True)
See a Pandas test:
import pandas as pd
df = pd.DataFrame({'urls':['https://pippo.it, https://pluto.it', 'http://blah.com'], 'urls2':['http://blah.net', 'https://pippo.it, https://pluto.it']})
df.replace(r"\s*(?:,\s*)?https://pluto\.it\b", "", regex=True, inplace=True)
So, if the initial dataframe looks like
urls urls2
0 https://pippo.it, https://pluto.it http://blah.net
1 http://blah.com https://pippo.it, https://pluto.it
The output will be:
urls urls2
0 https://pippo.it http://blah.net
1 http://blah.com https://pippo.it
The inplace=True
makes the changes directly to the dataframe, no need to reassign the variable.
The \s*(?:,\s*)?https://pluto\.it\b
regex needs more attention:
\s*
- zero or more whitespaces(?:,\s*)?
- an optional sequence of a comma and then zero or more whitespaceshttps://pluto\.it
- a literalhttps://pluto.it
string\b
- a word boundary (used to matchit
but notita
,it0
,it_
etc.). Note: If you want to make sure there is end of string, use$
. If you want to make sure there can be a whitespace or end of string only after the URL, use(?!\S)
instead of\b
.
CodePudding user response:
Assuming the structure of the string element to be replaced is always the same, you could also do it very easily and without Regex as follows:
df.replace("https://pippo.it, https://pluto.it", "https://pippo.it")