How to remove string after .com and "https://" from an URL in Python-CodePudding

I need to merge two dataframe by using url as a primary key. However, there are some extra strings in the url like in df1, I have https://www.mcdonalds.com/us/en-us.html, where in df2, I have https://www.mcdonalds.com

I need to remove the /us/en-us.html after the .com and the https:// from the url, so I can perform the merge using url between 2 dfs. Below is a simplified example. What would be the solution for this?

df1={'url': ['https://www.mcdonalds.com/us/en-us.html','https://www.cemexusa.com/find-your- 
location']}
df2={'url':['https://www.mcdonalds.com','www.cemexusa.com']}

df1['url']==df2['url']
Out[7]: False

Thanks.

CodePudding user response：

URLs are not trivial to parse. Take a look at the urllib module in the standard library.

Here's how you could remove the path after the domain:

import urllib.parse

def remove_path(url):
    parsed = urllib.parse.urlparse(url)
    parsed = parsed._replace(path='')
    return urllib.parse.urlunparse(parsed)

df1['url'] = df1['url'].apply(remove_path)

CodePudding user response：

Use urlparse and isolate the hostname:

from urllib.parse import urlparse

urlparse('https://www.mcdonalds.com/us/en-us.html').hostname
# 'www.mcdonalds.com'

CodePudding user response：

You can use urlparse as suggested by others, or you could also use urlsplit. However, both will not handle www.cemexusa.com. So if you do not need the scheme in your key, you could use something like this:

def to_key(url):
    if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
        url = f"https://{url}"
    return urlsplit(url).hostname

df1["Key"] = df1["URL"].apply(to_key)

Here is a full working example:

import pandas as pd
import io

from urllib.parse import urlsplit

df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")

df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")

df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)

def to_key(url):
    if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
        url = f"https://{url}"
    return urlsplit(url).hostname
    
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)

joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))

# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)

The output of print(joined) would be:

  Description                Key  Last Update
0   Junk Food  www.mcdonalds.com         2021
1       Cemex   www.cemexusa.com         2020

There may be other special cases not handled in this answer. Depending on your data, you may also need to handle an omitted www:

urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com

urlsplit("https://www.realpython.com").hostname  # also a valid URL
# www.realpython.com

What is the difference between urlparse and urlsplit?

It depends on your use case and what information you would like to extract. Since you do not need the URL's params, I would suggest using urlsplit.

[urlsplit()] is similar to urlparse(), but does not split the params from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit