I'm using bs4 to scrape links from a scrolling marquee. I'm able to get the marquee data, which is returned as a bs4 resultSet element. However, I cannot seem to access the href's within the data. I'm sure I'm missing something as I'm new to web scraping, and appreciate any guidance anyone has.
Note: I can get the links easy peasy with selenium and chrome driver, but it takes forever.
This returns all of the marquee data:
url = 'https://drugs.globalincidentmap.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
marquee = soup.select('div', class_='h-48')
print(marquee)
However when I try to drill down further into the data, I get the empty list or NoneType
/KeyError
or AttributeError
.
for a in marquee.find_all('a', href=True):
link = a.find('div', class_=':nth-child')
or
for a in marquee.find_all('a', href=True):
link = a.find('div', class_='flex p-2')
Links in marquee
CodePudding user response:
I can get the links easy peasy with selenium and chrome driver
Probably because the div
with h-48
class is loaded with JavaScript; even if it wasn't, I don't think soup.find('div', class_='h-48')
would work because that element has more classes, and you need to pass all of them as class_
[and I don't think soup.select('div', class_='h-48')
gives the exact results you expect it to - select
isn't really supposed to have a class_
argument - just a CSS selector string].
soup.find('div', attrs={'class':'h-48'})
or soup.select('div.h-48')
can be expected to work on the html that is formed after JS loading, but you need selenium to get that...
Fortunately, I think the data you want is already in the fetched html, just in a different format - you can extract a list of dictionaries (mqCont
) with
# import json
marq = soup.find('marquee', attrs={'class':'h-48'})
if marq is None: print('Could Not Find marquee.h-48')
if not marq.get(':contents'): print('marquee.h-48 has no [:contents] attr')
try: mqCont = json.loads(marq.get(':contents', '[]'))
except Exception as e:
mqCont = []
print('failed to parse marquee.h-48[:contents] <---', e)
or, more shortly (if you're confident there won't be any error to debug/breakdown):
mqCont = json.loads(soup.select_one('marquee.h-48').get(':contents', '[]'))
You can get a list of links to news articles with [m['url'] for m in mqCont if 'url' in m]
, but since you were trying to get find
with class_='flex p-2'
, you probably want the .../event_detail?id=...
links. You can form them as below
evtUrls = [f"{url.strip('/')}/event_detail?id={m['id']}" for m in mqCont if 'id' in m]
You can also view the list of dictionaries as a table [with pandas] by doing something like:
# import pandas
omitKeys = ['domain_event_types', 'country']
for i, m in enumerate(mqCont):
mDesc = ' '.join(w for w in BeautifulSoup(
m['description'] if 'description' in m else ''
).get_text().split() if w)
if mDesc: m['description'] = mDesc
if 'id' in m: m['eventUrl'] = f"{url.strip('/')}/event_detail?id={m['id']}"
mqCont[i] = {k:v for k, v in m.items() if k not in omitKeys}
mqcDF = pandas.DataFrame(mqCont).dropna(axis='columns', how='all').set_index('id')
and the first 5 rows [of 100 rows total] of mqcDF
:
id | country_id | address | event_gmt_time | severity | infrastructure | tip_text | url | description | latitude | longitude | created_user_id | location_granularity_id | is_approved | created_at | updated_at | eventUrl |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11919404 | 231 | Pennsylvania, USA | 2022-12-01 18:36:53 | Severe | Unknown | PENNSYLVANIA - Photos - Suspects - Evidence In Multi-County Drug Bust | https://www.wfmz.com/news/area/berks/photos-suspects-evidence-in-multi-county-drug-bust/collection_bf795c98-71ad-11ed-99fe-4305f426699b.html#1 | [69 NEWS] PENNSYLVANIA - PHOTOS: Suspects, evidence in multi-county drug bust "Authorities said they seized evidence that included 27.5 kilograms of cocaine with a potential street value of $2.7 million and 5.5 kilograms of fentanyl with a potential street value of $1.6 million." Read full article at: https://www.wfmz.com/news/area/berks/photos-suspects-evidence-in-multi-county-drug-bust/collection_bf795c98-71ad-11ed-99fe-4305f426699b.html#1 | 41.2033 | -77.1945 | 14 | 8 | 1 | 2022-12-02T18:44:43.000000Z | 2022-12-02T18:44:43.000000Z | https://drugs.globalincidentmap.com/event_detail?id=11919404 |
11919401 | 40 | Vancouver Island, British Columbia, Canada | 2022-12-01 18:33:01 | Severe | Unknown | CANADA - Drugs - Guns Seized As 4 BC Men With Hells Angels Ties Face Serious Charges | https://www.terracestandard.com/news/alleged-drug-traffickers-on-vancouver-island-with-hells-angels-ties-face-serious-charges/ | [terracestandard.com] CANADA - Drugs, guns seized as 4 B.C. men with Hells Angels ties face ‘serious charges’ "CFSEU said the seized drugs included 7.75kg of cocaine, 4kg of cannabis, 1.9kg of methamphetamine, 248 oxycodone pills, and more." Read full article at: https://www.terracestandard.com/news/alleged-drug-traffickers-on-vancouver-island-with-hells-angels-ties-face-serious-charges/ | 49.6506 | -125.449 | 14 | 5 | 1 | 2022-12-02T18:36:37.000000Z | 2022-12-02T18:36:37.000000Z | https://drugs.globalincidentmap.com/event_detail?id=11919401 |
11919397 | 133 | Male, Maldives | 2022-11-20 18:29:26 | Severe | Unknown | MALDIVES - Drugs Worth Mvr 2 Mln Seized By Customs | https://avas.mv/en/125385 | [avas.mv] MALDIVES - Drugs worth MVR 2 mln seized by Customs "Maldives Customs Service has seized 1.34 kg of drugs smuggled into the Maldives via courier." Read full article at: https://avas.mv/en/125385 | 4.1755 | 73.5093 | 14 | 5 | 1 | 2022-12-02T18:32:45.000000Z | 2022-12-02T18:32:45.000000Z | https://drugs.globalincidentmap.com/event_detail?id=11919397 |
11919394 | 231 | 100 South Willow Avenue, Compton, CA, USA | 2022-11-29 18:23:50 | Severe | Unknown | CALIFORNIA - USD4 Million Worth Of Illegal Drugs Seized In Compton | https://www.foxla.com/news/4-million-worth-of-illegal-drugs-seized-in-compton | [foxla] CALIFORNIA - $4 million worth of illegal drugs seized in Compton "A search warrant at the home resulted in the seizure of about 5.5 lbs. of suspected tar heroin, 10 kilos of suspected powder cocaine, 6 kilos of suspected powder fentanyl, 6,000 suspected ecstasy pills containing fentanyl, and 254,000 suspected fentanyl pills all worth a combined estimated street value of $4.17 million, authorities said. " Read full article at: https://www.foxla.com/news/4-million-worth-of-illegal-drugs-seized-in-compton | 33.896 | -118.218 | 14 | 5 | 1 | 2022-12-02T18:29:25.000000Z | 2022-12-02T18:29:25.000000Z | https://drugs.globalincidentmap.com/event_detail?id=11919394 |
11919392 | 166 | Gwadar, Pakistan | 2022-12-01 18:22:00 | Severe | Unknown | PAKISTAN - Convoy Of Camels Loaded With Drugs Seized | https://pakobserver.net/convoy-of-camels-loaded-with-drugs-seized/ | [pakobserver.net] PAKISTAN - Convoy Of Camels Loaded With Drugs Seized "While searching the goods carried by the camels, ANF officials found them to be full of drugs (hashish). The drugs weighed around 1.4 tons." Read full article at: https://pakobserver.net/convoy-of-camels-loaded-with-drugs-seized/ | 25.1313 | 62.325 | 14 | 5 | 1 | 2022-12-02T18:23:49.000000Z | 2022-12-02T18:23:49.000000Z | https://drugs.globalincidentmap.com/event_detail?id=11919392 |
Markdown for the above table was printed with print(mqcDf.loc[mqcDf.index[:5]].to_markdown())