Home > other >  For help
For help

Time:04-14

QUESTION

Before you get started, explore the website https://www.abc.net.au/news/justin

We are interested in the titles of the and hyperlinks to the news items in the Just in section. More specifically, that 's the section that is highlighted in the following image:

The content on The page has obviously had been updated since this exercise was set up. Still, my The exact content changed, The principle stays The same.

What you need to do is:

Scrape all the titles (1), (2) the physicist hyperlinks, and the (3) the descriptions of the news items in the highlighted section of the ash in the page, in the historic example in the screenshot that whenever 'Serious grounds to be concerned about... ', 'Tourism sandbox: Phuket... ', etc. For the titles, the urls you are directed to an order to click those titles, and the descriptions' Allies of jailed... ', 'Thailand' s resort mixes island... ', etc., that belong to the titles.
Save the information into a CSV file named 'abcnews. CSV' that contains three variables: 'title', 'url', and 'descriptions. One row for each article, combining the title, the hyperlink, and the description for that article.

CodePudding user response:

The from urllib. Request the import request, urlopen
The import SSL
The from bs4 import BeautifulSoup
The import pandas as pd

Url='https://www.abc.net.au/news/justin'

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # #

Headers={' the user-agent ':' Mozilla/5.0 (Macinstosh; Intel Mac OS X 10 _10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 '}
The req=Request (url, headers=headers)
The context=SSL. _create_unverified_context ()

UClient=urlopen (the req, context=context)
HTML=uClient. Read ()
UClient. Close ()

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

Soup=BeautifulSoup (HTML, '. The HTML parser)
Maindiv=soup. The find (' div 'class_="JustInPaginationList")

The dataset=[]

For the item in maindiv (' a ') :
Title=item. The find (" p "). The getText ()
Url=item [' href ']
Print (the title)
Print (url)
Print ()

The dataset. Append ({' title: the title, 'url: url})

The dataset=pd. DataFrame (dataset)
The dataset. To_csv (' abcnews. CSV, sep='; ', the index=False)

As such, now 28 line displays can't call, how do you write to conform to the requirements of the above

  • Related