I'm making a project that takes google searches via the googlesearch module, and sorts them by the top-level domain. I'll use COVID-19 as an example.
Input:
for search in googlesearch.search("COVID-19", lang='en'):
print(search)
Output:
https://www.cdc.gov/coronavirus/2019-ncov/index.html
https://coronavirus.jhu.edu/map.html
https://www.who.int/emergencies/diseases/novel-coronavirus-2019
https://www.who.int/health-topics/coronavirus
https://www.worldometers.info/coronavirus/
https://en.wikipedia.org/wiki/COVID-19
https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home
https://www.michigan.gov/coronavirus/
https://coronavirus.in.gov/
https://www.osha.gov/coronavirus
https://covid19.nj.gov/
So that part works, I can change to input into:
search_results = []
for search in googlesearch.search("COVID-19", lang='en'):
search_results.append(search)
to have a list of the websites. Now, I want to sort them using this order:
[".gov/", ".int/", ".com/", ".edu/", ".org/", ".info/"]
Will probably change the order and/or add more domains later. So, I want the sorted version to be:
https://www.cdc.gov/coronavirus/2019-ncov/index.html
https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home
https://www.michigan.gov/coronavirus/
https://coronavirus.in.gov/
https://www.osha.gov/coronavirus
https://covid19.nj.gov/
https://www.who.int/emergencies/diseases/novel-coronavirus-2019
https://www.who.int/health-topics/coronavirus
https://coronavirus.jhu.edu/map.html
https://en.wikipedia.org/wiki/COVID-19
https://www.worldometers.info/coronavirus/
Any idea on how I can do this?
CodePudding user response:
One approach would be to create a dictionary of domain extensions along with ranks for sorting the URLs. Then, call sorted
with a lambda expression which extracts the domain extension from each URL and does a look up for the sorting value.
domains = {"gov" : 1, "int" : 2, "com" : 3, "edu" : 4, "org" : 5, "info" : 6}
urls = ['https://www.cdc.gov/coronavirus/2019-ncov/index.html',
'https://coronavirus.jhu.edu/map.html',
'https://www.who.int/emergencies/diseases/novel-coronavirus-2019',
'https://www.who.int/health-topics/coronavirus',
'https://www.worldometers.info/coronavirus/',
'https://en.wikipedia.org/wiki/COVID-19',
'https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home',
'https://www.michigan.gov/coronavirus/',
'https://coronavirus.in.gov/',
'https://www.osha.gov/coronavirus',
'https://covid19.nj.gov/']
urls = sorted(urls, key=lambda x: domains[re.sub(r'^https?://[^/] \.([^/] )/.*$', r'\1', x)])
print(urls)
This prints:
['https://www.cdc.gov/coronavirus/2019-ncov/index.html', # .gov
'https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home', # .gov
'https://www.michigan.gov/coronavirus/', # .gov
'https://coronavirus.in.gov/', # .gov
'https://www.osha.gov/coronavirus', 'https://covid19.nj.gov/', # .gov
'https://www.who.int/emergencies/diseases/novel-coronavirus-2019', # .int
'https://www.who.int/health-topics/coronavirus', # .int
'https://coronavirus.jhu.edu/map.html', # .edu
'https://en.wikipedia.org/wiki/COVID-19', # .org
'https://www.worldometers.info/coronavirus/'] # .info