I have a column in a pandas dataframe that holds various URLs to websites:
df:
ID URL
0 1 https://www.Facebook.com/fr
1 2 https://Twitter.com/de
2 3 https://www.Youtube.com
3 4 www.Microsoft.com
4 5 https://www.Stackovervlow.com
I am using urlparse().netloc
to clean the URLs to only have the domain names (e.g., from https://www.Facebook.com/fr to www.Facebook.com). Some of the URLs are already in a clean format (www.Microsoft.com above), and applying urlparse().netloc
to these clean URLs results in an empty cell. Therefore, I am trying to apply the urlparse().netloc
function to elements of the URL column where the element contains the string 'http', else it should return the original URL. Here is the code I have be trying to use:
df['URL'] = df['URL'].apply(
lambda x: urlparse(x).netloc if x.str.contains("http", na=False) else x
)
However, I get this error message: AttributeError: 'str' object has no attribute 'str'
. Any help on how I can overcome this to complete the task would be much appreciated!
CodePudding user response:
You are using pandas.Series.apply
therefore your function (lambda) receives element (str
) itself, so you might simply us in
as follows
df['URL'] = df['URL'].apply(
lambda x: urlparse(x).netloc if "http" in x else x
)
CodePudding user response:
x
is already a string not the Series
. So use x.find
:
df['URL'] = df['URL'].apply(
lambda x: urlparse(x).netloc if x.find("http") != -1 else x
)
print(df)
# Output:
ID URL
0 1 www.Facebook.com
1 2 Twitter.com
2 3 www.Youtube.com
3 4 www.Microsoft.com
4 5 www.Stackovervlow.com
But you can use str.extract
to get netloc:
df['URL'] = df['URL'].str.extract(r'(?:^https?://)?([^/] )', expand=False)
print(df)
# Output:
ID URL
0 1 www.Facebook.com
1 2 Twitter.com
2 3 www.Youtube.com
3 4 www.Microsoft.com
4 5 www.Stackovervlow.com