Extracting one url each from similar types-CodePudding

I have one csv file containing thousands of urls. How is it possible to select randomly one url from each base type url. The order of getting url can be anyway. It has to be random.

import pandas as pd

# initialise data of lists.
data = {'url':['https://alabamasymphony.org/event/shamrocks-strings', 
               'https://alabamasymphony.org/event/emperor', 
               'https://mobilesymphony.org/event/fanfare',
               'https://mobilesymphony.org/event/the-fireworks-of-jupiter/',
               'https://www.hso.org/concerts/liszt-fantasy/',
               'https://www.juneausymphony.org/apr2019/']}

# Create DataFrame
df = pd.DataFrame(data)
df

Expected output

['https://alabamasymphony.org/event/emperor','https://mobilesymphony.org/event/fanfare','https://www.hso.org/concerts/liszt-fantasy/','https://www.juneausymphony.org/apr2019/']

CodePudding user response：

The first thing you would need to do would be to extract the base url, which can do done using urllib.

You can then use groupby with sample to extract a random url for each base_url.

import urllib.parse
import pandas as pd


# initialise data of lists.
data = {'url':['https://alabamasymphony.org/event/shamrocks-strings', 
               'https://alabamasymphony.org/event/emperor', 
               'https://mobilesymphony.org/event/fanfare',
               'https://mobilesymphony.org/event/the-fireworks-of-jupiter/',
               'https://www.hso.org/concerts/liszt-fantasy/',
               'https://www.juneausymphony.org/apr2019/']}

# Create DataFrame
df = pd.DataFrame(data)

df['base_url'] = df['url'].apply(lambda url: urllib.parse.urlparse(url).netloc)

random = df.groupby('base_url').sample(n=1)

print(random)

                                           url                base_url
1    https://alabamasymphony.org/event/emperor     alabamasymphony.org
2     https://mobilesymphony.org/event/fanfare      mobilesymphony.org
4  https://www.hso.org/concerts/liszt-fantasy/             www.hso.org
5      https://www.juneausymphony.org/apr2019/  www.juneausymphony.org