Home > Back-end >  Web Scrape a specific tag using python BeautifulSoup
Web Scrape a specific tag using python BeautifulSoup

Time:08-29

I am working on a self-project where I am trying to analyze the causes that happened due to the unethical use of AI systems. I am trying to web scrape this website.

URL - enter image description here

CodePudding user response:

urls=[x.get_attribute('href') for x in driver.find_elements(By.XPATH,"//div[@class='h-100 card']/a")]

If you want the 28 or so elements hrefs you can grab them like so. You can add Webdriver Waits if there is excess page loading.

CodePudding user response:

This is a very interesting question, by its very nature of an X-Y Problem. Selenium is not the right tool for this this job, imho. Page is (very) dynamic, and beside being hydrated from external APIs, is also analyzing user interaction and loading the data as you scroll. Of course, it's possible to do it with selenium as well, but there is a better way. There are 311 incidents, all of them extensively documented. The way forward here is to scrape the api endpoints for each one of them: the result will be a huge json object, very detailed. For example, to scrape the first 20 incidents using requests and pandas:

import requests
import pandas as pd
from tqdm import tqdm

big_df = pd.DataFrame()
for counter in tqdm(range(1, 20)):
    r = requests.get(f'https://incidentdatabase.ai/page-data/cite/{counter}/page-data.json')
    df = pd.json_normalize(r.json()['result']['pageContext']['incidentReports'])
    big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)

This will result in:

19/19 [01:00<00:00, 3.25s/it]
submitters  date_published  report_number   title   url image_url   cloudinary_id   source_domain   mongodb_id  text    authors epoch_date_submitted    language
0   [Roman Yampolskiy]  2015-05-19  1   Google’s YouTube Kids App Criticized for ‘Inappropriate Content’    https://blogs.wsj.com/digits/2015/05/19/googles-youtube-kids-app-criticized-for-inappropriate-content/  http://si.wsj.net/public/resources/images/BN-IM269_YouTub_P_20150518174822.jpg  reports/si.wsj.net/public/resources/images/BN-IM269_YouTub_P_20150518174822.jpg blogs.wsj.com   5d34b8c29ced494f010ed45a    Child and consumer advocacy groups complained to the Federal Trade Commission Tuesday that Google’s new YouTube Kids app contains “inappropriate content,” including explicit sexual language and jokes about pedophilia.\n\nGoogle launched the app for young children in February, saying the available videos were “narrowed down to content appropriate for kids.”  [Alistair Barr] 1559347200  en
1   [Roman Yampolskiy]  2018-02-07  2   YouTube Kids app is STILL showing disturbing videos https://www.dailymail.co.uk/sciencetech/article-5358365/YouTube-Kids-app-showing-disturbing-videos.html https://i.dailymail.co.uk/i/pix/2018/02/06/15/48EEE02F00000578-0-image-a-18_1517931140185.jpg   reports/i.dailymail.co.uk/i/pix/2018/02/06/15/48EEE02F00000578-0-image-a-18_1517931140185.jpg   dailymail.co.uk 5d34b8c29ced494f010ed45b    Google-owned YouTube has apologised again after more disturbing videos surfaced on its YouTube Kids app.\n\nInvestigators found several unsuitable videos including one of a burning aeroplane from the cartoon Paw Patrol and footage explaining how to sharpen a knife.\n\nYouTube has been criticised for using algorithms to sieve through material rather than using human moderators to judge what might be appropriate.\n\nThere have been hundreds of disturbing videos found on YouTube Kids in recent months that are easily accessed by children.\n\nThese videos have featured horrible things happening to various characters, including ones from the Disney movie Frozen, the Minions franchise, Doc McStuffins and Thomas the Tank Engine.\n\nParents, regulators, advertisers and law enforcement have become increasingly concerned about the open nature of the service.\n\nScroll down for video\n\nYouTube has apologised again after more disturbing videos surfaced on its YouTube Kids app. Investigators found several unsuitable videos including one from the cartoon Paw Patrol on a burning aeroplane and footage showing how to sharpen a knife\n\nA YouTube spokesperson has admitted the company needs to 'do more' to tackle inappropriate videos on their kids platform.\n\nThis investigation is the latest to expose inappropriate content on the video-sharing site which has been subject to a slew of controversies since its creation in 2005.\n\nAs part of an in-depth investigation by BBC Newsround, Google's Public Policy Manager Katie O'Donovan met five children who told her about the distressing videos they had seen on the site.\n\nThey included videos showing clowns covered in blood and messages warning them there was someone at the door.\n\nMs O'Donovan said she was 'very, very sorry for any hurt or discomfort'.\n\n'We've actually built a whole new platform for kids, called YouTube Kids, where we take the best content, stuff that children are most interested in and put it on there in a packaged up place just for kids,' she said.\n\nIt normally takes five days for supposedly child-friendly content like cartoons to get from YouTube to YouTube Kids.\n\nWithin that window it is hoped users and a specially-trained team will flag disturbing content.\n\nOnce it has been flagged and reviewed, it won't appear on the YouTube Kids app and only people who are signed in and older than 18 years old will be able to view it.\n\nThe company say thousands of people will be working around the clock to flag content.\n\nHowever, as part of the investigation Newsround revealed there are still lots of inappropriate videos on the Kids section.\n\n'We have seen significant investment in building the right tools so people can flag that [content], and those flags are reviewed very, very quickly', Ms O'Donovan said.\n\n'We're also beginning to use machine learning to identify the most harmful content, which is then automatically reviewed.'\n\nThe problem was managing an open platform where content is uploaded straight onto the site, she added.\n\n'It is a difficult environment because things are moving so, so quickly', said Ms O'Donovan.\n\n'We have a responsibility to make sure the platform can survive and can thrive so that we have a collection that comes from around the world on there'.\n\nBy the end of last year YouTube said it had removed more than 50 user channels and had stopped running ads on more than 3.5 million videos since June.\n\n'Content that endangers children is unacceptable to us and we have clear policies against such videos on YouTube and YouTube Kids', a YouTube spokesperson told MailOnline.\n\n'When we discover any inappropriate content, we quickly take action to remove it from our platform.\n\n'Over the past few months, we've taken a series of steps to tackle many of the emerging challenges around family content on YouTube, including: tightening enforcement of our Community Guidelines, age-gating content that inappropriately targets families, and removing it from the YouTube Kids app.'\n\nYouTube has been criticised for using algorithms to sieve through material rather than using human moderators to judge what might be appropriate (stock image)\n\nIn March, a disturbing Peppa Pig fake, found by journalist Laura June, shows a dentist with a huge syringe pulling out the character's teeth as she screams in distress.\n\nMrs June only realised the violent nature of the video as her three-year-old daughter watched it beside her.\n\n'Peppa does a lot of screaming and crying and the dentist is just a bit sadistic and it's just way, way off what a three-year-old should watch,' she said.\n\n'But the animation is close enough to looking like Peppa - it's crude but it's close enough that my daughter was like 'This is Peppa Pig.''\n\nAnother video depicted Peppa Pig and a friend deliberately burning down a house with someone in it.\n\nAll of these videos are easily accessed by children through YouTube's search results or recommended videos.\n\nIn March, a disturbing Peppa Pig fake, found by journalist Laura June, shows a dentist with a huge syringe pulling ou   [Phoebe Weston] 1559347200
[...]

JSON response(s) can be further dissected and analysed, and more useful information can be pulled from them (including euclidean distance between incidents, etc - really a lot).

Requests docs: https://requests.readthedocs.io/en/latest/

Pandas docs: https://pandas.pydata.org/pandas-docs/stable/index.html

And for tqdm: https://tqdm.github.io/

  • Related