Home > Enterprise >  Sort items in list by index of a substring in another list
Sort items in list by index of a substring in another list

Time:11-15

I'm making a project that takes google searches via the googlesearch module, and sorts them by the top-level domain. I'll use COVID-19 as an example.

Input:

for search in googlesearch.search("COVID-19", lang='en'):
    print(search)

Output:

https://www.cdc.gov/coronavirus/2019-ncov/index.html
https://coronavirus.jhu.edu/map.html
https://www.who.int/emergencies/diseases/novel-coronavirus-2019
https://www.who.int/health-topics/coronavirus
https://www.worldometers.info/coronavirus/
https://en.wikipedia.org/wiki/COVID-19
https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home
https://www.michigan.gov/coronavirus/
https://coronavirus.in.gov/
https://www.osha.gov/coronavirus
https://covid19.nj.gov/

So that part works, I can change to input into:

search_results = []
for search in googlesearch.search("COVID-19", lang='en'):
    search_results.append(search)

to have a list of the websites. Now, I want to sort them using this order:

[".gov/", ".int/", ".com/", ".edu/", ".org/", ".info/"]

Will probably change the order and/or add more domains later. So, I want the sorted version to be:

https://www.cdc.gov/coronavirus/2019-ncov/index.html
https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home
https://www.michigan.gov/coronavirus/
https://coronavirus.in.gov/
https://www.osha.gov/coronavirus
https://covid19.nj.gov/
https://www.who.int/emergencies/diseases/novel-coronavirus-2019
https://www.who.int/health-topics/coronavirus
https://coronavirus.jhu.edu/map.html
https://en.wikipedia.org/wiki/COVID-19
https://www.worldometers.info/coronavirus/

Any idea on how I can do this?

CodePudding user response:

One approach would be to create a dictionary of domain extensions along with ranks for sorting the URLs. Then, call sorted with a lambda expression which extracts the domain extension from each URL and does a look up for the sorting value.

domains = {"gov" : 1, "int" : 2, "com" : 3, "edu" : 4, "org" : 5, "info" : 6}
urls = ['https://www.cdc.gov/coronavirus/2019-ncov/index.html',
    'https://coronavirus.jhu.edu/map.html',
    'https://www.who.int/emergencies/diseases/novel-coronavirus-2019',
    'https://www.who.int/health-topics/coronavirus',
    'https://www.worldometers.info/coronavirus/',
    'https://en.wikipedia.org/wiki/COVID-19',
    'https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home',
    'https://www.michigan.gov/coronavirus/',
    'https://coronavirus.in.gov/',
    'https://www.osha.gov/coronavirus',
    'https://covid19.nj.gov/']
urls = sorted(urls, key=lambda x: domains[re.sub(r'^https?://[^/] \.([^/] )/.*$', r'\1', x)])
print(urls)

This prints:

['https://www.cdc.gov/coronavirus/2019-ncov/index.html',             # .gov
 'https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home',        # .gov
 'https://www.michigan.gov/coronavirus/',                            # .gov
 'https://coronavirus.in.gov/',                                      # .gov
 'https://www.osha.gov/coronavirus', 'https://covid19.nj.gov/',      # .gov
 'https://www.who.int/emergencies/diseases/novel-coronavirus-2019',  # .int
 'https://www.who.int/health-topics/coronavirus',                    # .int
 'https://coronavirus.jhu.edu/map.html',                             # .edu
 'https://en.wikipedia.org/wiki/COVID-19',                           # .org
 'https://www.worldometers.info/coronavirus/']                       # .info
  • Related