I am currently writing a code to extract main urls from urls for example: if the input is https://www.google.com/example/exmaple.html I need the output to be: https://www.google.com or www.google.com
tried using regex to do this howver didnt work and slicing does not work as their are too many slashes pls help
Note: Give the answer in python
CodePudding user response:
Try using urlparse.
from urllib.parse import urlparse
long_url = "https://www.google.com/example/example.html"
# Parse the long_url using the urlparse module
parsed_url = urlparse(long_url)
# Extract the scheme and hostname from the parsed_url
main_url = parsed_url.scheme "://" parsed_url.hostname
# Print the main_url
print(main_url)
Or if you only want the hostname without the scheme, just use hostname in the parsed_url:
from urllib.parse import urlparse
long_url = "https://www.google.com/example/example.html"
# Parse the long_url using the urlparse module
parsed_url = urlparse(long_url)
# Extract the hostname from the parsed_url
hostname = parsed_url.hostname
# Print the hostname
print(hostname)
CodePudding user response:
To extract the main URL from a URL string in Python, you can use the urllib.parse.urlparse() function from the urllib.parse module. This function parses a URL string and returns a ParseResult object containing the different parts of the URL, including the scheme, hostname, and path. You can then use this ParseResult object to extract the main URL by combining the scheme and hostname parts of the URL.
Here is an example of how you can use the urllib.parse.urlparse() function to extract the main URL from a given URL string:
from urllib.parse import urlparse
# Function to extract the main URL from a given URL string
def extract_main_url(url):
# Parse the URL string using the urlparse function
parsed_url = urlparse(url)
# Extract the scheme and hostname parts from the parsed URL
scheme = parsed_url.scheme
hostname = parsed_url.hostname
# Combine the scheme and hostname parts to form the main URL
main_url = scheme "://" hostname
return main_url
# Test the extract_main_url function with a few different URLs
print(extract_main_url("https://www.google.com/example/example.html")) # Output: https://www.google.com
print(extract_main_url("https://www.google.com/search?q=query")) # Output: https://www.google.com
print(extract_main_url("https://github.com/user/repo")) # Output: https://github.com
In this example, the extract_main_url function takes a URL string as its argument, and uses the urllib.parse.urlparse() function to parse the URL into its different parts. It then extracts the scheme and hostname parts of the URL, and combines them to form the main URL. Finally, it returns the main URL as its output.
You can test this function with a few different URLs to see how it extracts the main URL from the given URL strings. As you can see, it will return the main URL (i.e. the scheme and hostname parts of the URL) for any given URL string.