I have one csv file containing thousands of urls. How is it possible to select randomly one url from each base type url. The order of getting url can be anyway. It has to be random.
import pandas as pd
# initialise data of lists.
data = {'url':['https://alabamasymphony.org/event/shamrocks-strings',
'https://alabamasymphony.org/event/emperor',
'https://mobilesymphony.org/event/fanfare',
'https://mobilesymphony.org/event/the-fireworks-of-jupiter/',
'https://www.hso.org/concerts/liszt-fantasy/',
'https://www.juneausymphony.org/apr2019/']}
# Create DataFrame
df = pd.DataFrame(data)
df
Expected output
['https://alabamasymphony.org/event/emperor','https://mobilesymphony.org/event/fanfare','https://www.hso.org/concerts/liszt-fantasy/','https://www.juneausymphony.org/apr2019/']
CodePudding user response:
The first thing you would need to do would be to extract the base url, which can do done using urllib
.
You can then use groupby
with sample
to extract a random url for each base_url.
import urllib.parse
import pandas as pd
# initialise data of lists.
data = {'url':['https://alabamasymphony.org/event/shamrocks-strings',
'https://alabamasymphony.org/event/emperor',
'https://mobilesymphony.org/event/fanfare',
'https://mobilesymphony.org/event/the-fireworks-of-jupiter/',
'https://www.hso.org/concerts/liszt-fantasy/',
'https://www.juneausymphony.org/apr2019/']}
# Create DataFrame
df = pd.DataFrame(data)
df['base_url'] = df['url'].apply(lambda url: urllib.parse.urlparse(url).netloc)
random = df.groupby('base_url').sample(n=1)
print(random)
url base_url
1 https://alabamasymphony.org/event/emperor alabamasymphony.org
2 https://mobilesymphony.org/event/fanfare mobilesymphony.org
4 https://www.hso.org/concerts/liszt-fantasy/ www.hso.org
5 https://www.juneausymphony.org/apr2019/ www.juneausymphony.org