I was able to scrape the title, date, links, and content of news on these links: https://www.news24.com/news24/southafrica/crime-and-courts
and https://www.news24.com/news24/southafrica/education
. The output is saved in an excel file. However, I noticed that not all the contents inside the articles were scrapped. I have tried different methods on my "Getting content section of my code" Any help with this will be appreciate. Below is my code:
import sys, time
from bs4 import BeautifulSoup
import requests
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from datetime import timedelta
art_title = [] # to store the titles of all news article
art_date = [] # to store the dates of all news article
art_link = [] # to store the links of all news article
pagesToGet = ['southafrica/crime-and-courts', 'southafrica/education']
for i in range(0, len(pagesToGet)):
print('processing page : \n')
url = 'https://www.news24.com/news24/' str(pagesToGet[i])
print(url)
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
try:
driver.get("https://www.news24.com/news24/" str(pagesToGet[i]))
except Exception as e:
error_type, error_obj, error_info = sys.exc_info()
print('ERROR FOR LINK:', url)
print(error_type, 'Line:', error_info.tb_lineno)
continue
time.sleep(3)
scroll_pause_time = 1
screen_height = driver.execute_script("return window.screen.height;")
i = 1
while True:
driver.execute_script("window.scrollTo(0, {screen_height}{i});".format(screen_height=screen_height, i=i))
i = 1
time.sleep(scroll_pause_time)
scroll_height = driver.execute_script("return document.body.scrollHeight;")
if (screen_height) * i > scroll_height:
break
soup = BeautifulSoup(driver.page_source, 'html.parser')
news = soup.find_all('article', attrs={'class': 'article-item'})
print(len(news))
# Getting titles, dates, and links
for j in news:
titles = j.select_one('.article-item__title span')
title = titles.text.strip()
dates = j.find('p', attrs={'class': 'article-item__date'})
date = dates.text.strip()
address = j.find('a').get('href')
news_link = 'https://www.news24.com' address
art_title.append(title)
art_date.append(date)
art_link.append(news_link)
df = pd.DataFrame({'Article_Title': art_title, 'Date': art_date, 'Source': art_link})
# Getting Content Section
news_articles = [] # to store the content of each news artcle
news_count = 0
for link in df['Source']:
print('\n')
start_time = time.monotonic()
print('Article No. ', news_count)
print('Link: ', link)
# Countermeasure for broken links
try:
if requests.get(link):
news_response = requests.get(link)
else:
print("")
except requests.exceptions.ConnectionError:
news_response = 'N/A'
# Auto sleep trigger after saving every 300 articles
sleep_time = ['100', '200', '300', '400', '500']
if news_count in sleep_time:
time.sleep(12)
else:
""
try:
if news_response.text:
news_data = news_response.text
else:
print('')
except AttributeError:
news_data = 'N/A'
news_soup = BeautifulSoup(news_data, 'html.parser')
try:
if news_soup.find('div', {'class': 'article__body'}):
art_cont = news_soup.find('div','article__body')
art = []
article_text = [i.text.strip().replace("\xa0", " ") for i in art_cont.findAll('p')]
art.append(article_text)
else:
print('')
except AttributeError:
article = 'N/A'
print('\n')
news_count = 1
news_articles.append(art)
end_time = time.monotonic()
print(timedelta(seconds=end_time - start_time))
print('\n')
# Create a column to add all the scraped text
df['News'] = news_articles
df.drop_duplicates(subset ="Source", keep = False, inplace = True)
# Dont store links
df.drop(columns=['Source'], axis=1, inplace=True)
df.to_excel('SA_news24_3.xlsx')
driver.quit()
I tried the following code in the Getting Content Section as well. However, it produced the same output.
article_text = [i.get_text(strip=True).replace("\xa0", " ") for i in art_cont.findAll('p')]
CodePudding user response:
The site has various types of URLs so your code was omitting them since they found it malformed or some had to be subscribed to read.For the ones that has to be subscribed to read i have added "Login to read" followers by the link in articles . I ran this code till article number 670 and it didn't give any error. I had to change it from .xlsx to .csv since it was giving an error of openpyxl in python 3.11.0.
Full Code
import time
import sys
from datetime import timedelta
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
art_title = [] # to store the titles of all news article
art_date = [] # to store the dates of all news article
art_link = [] # to store the links of all news article
pagesToGet = ['southafrica/crime-and-courts',
'southafrica/education', 'places/gauteng']
for i in range(0, len(pagesToGet)):
print('processing page : \n')
if "places" in pagesToGet[I]:
url = f"https://news24.com/api/article/loadmore/tag?tagType=places&tag={pagesToGet[i].split('/')[1]}&pagenumber=1&pagesize=100&ishomepage=false&ismobile=false"
else:
url = f"https://news24.com/api/article/loadmore/news24/{pagesToGet[i]}?pagenumber=1&pagesize=1200&ishomepage=false&ismobile=false"
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.json()["htmlContent"], 'html.parser')
news = soup.find_all('article', attrs={'class': 'article-item'})
print(len(news))
# Getting titles, dates, and links
for j in news:
titles = j.select_one('.article-item__title span')
title = titles.text.strip()
dates = j.find('p', attrs={'class': 'article-item__date'})
date = dates.text.strip()
address = j.find('a').get('href')
# Countermeasure for links with full url
if "https://" in address:
news_link = address
else:
news_link = 'https://www.news24.com' address
art_title.append(title)
art_date.append(date)
art_link.append(news_link)
df = pd.DataFrame({'Article_Title': art_title,
'Date': art_date, 'Source': art_link})
# Getting Content Section
news_articles = [] # to store the content of each news artcle
news_count = 0
for link in df['Source']:
start_time = time.monotonic()
print('Article No. ', news_count)
print('Link: ', link)
news_response = requests.get(link)
news_data = news_response.content
news_soup = BeautifulSoup(news_data, 'html.parser')
art_cont = news_soup.find('div', 'article__body')
# Countermeasure for links with subscribe form
try:
try:
article = art_cont.text.split("Newsletter")[
0] art_cont.text.split("Sign up")[1]
except:
article = art_cont.text
article = " ".join((article).strip().split())
except:
article = f"Login to read {link}"
news_count = 1
news_articles.append(article)
end_time = time.monotonic()
print(timedelta(seconds=end_time - start_time))
print('\n')
# Create a column to add all the scraped text
df['News'] = news_articles
df.drop_duplicates(subset="Source", keep=False, inplace=True)
# Dont store links
df.drop(columns=['Source'], axis=1, inplace=True)
df.to_csv('SA_news24_3.csv')
Output
,Article_Title,Date,News
0,Pastor gets double life sentence plus two 15-year terms for rape and murder of women,2h ago,"A pastor has been sentenced to two life sentences and two 15-year jail terms for rape and murder His modus operandi was to take the women to secluded areas before raping them and tying them to trees.One woman managed to escape after she was raped but the bodies of two others were found tied to the trees.The North West High Court has sentenced a 50-year-old pastor to two life terms behind bars and two 15-year jail terms for rape and murder.Lucas Chauke, 50, was sentenced on Monday for the crimes, which were committed in 2017 and 2018 in Temba in the North West.According to North West National Prosecuting Authority spokesperson Henry Mamothame, Chauke's first victim was a 53-year-old woman.He said Chauke pretended that he would assist the woman with her spirituality and took her to a secluded place near to a dam.""Upon arrival, he repeatedly raped her and subsequently tied her to a tree before fleeing the scene,"" Mamothame said. The woman managed to untie herself and ran to seek help. She reported the incident to the police, who then started searching for Chauke.READ | Kidnappings doubled nationally: over 4 000 cases reported to police from July to SeptemberOn 10 May the following year, Chauke pounced on his second victim - a 55-year-old woman.He took her to the same secluded area next to the dam, raped her and tied her to a tree before fleeing. This time his victim was unable to free herself.Her decomposed body was later found, still tied to the tree, Mamothame said. His third victim was targeted months later, on 3 August, in the same area. According to Mamothame, Chauke attempted to rape her but failed.""He then tied her to a tree and left her to die,"" he said. Chauke was charged in connection with her murder.He was linked to the crimes via DNA.READ | 'These are not pets': Man gives away his two pit bulls after news of child mauled to deathIn aggravation of his sentence, State advocate Benny Kalakgosi urged the court not to deviate from the prescribed minimum sentences, saying that the offences Chauke had committed were serious.""He further argued that Chauke took advantage of unsuspecting women, who trusted him as a pastor but instead, [he] took advantage of their vulnerability,"" Mamothame said. Judge Frances Snyman agreed with the State and described Chauke's actions as horrific.The judge also alluded to the position of trust that he abused, Mamothame said.Chauke was sentenced to life for the rape of the first victim, 15 years for the rape of the second victim and life for her murder, as well as 15 years for her murder. He was also declared unfit to possess a firearm."
1,'I am innocent': Alleged July unrest instigator Ngizwe Mchunu pleads not guilty,4h ago,"Former Ukhozi FM DJ Ngizwe Mchunu has denied inciting the July 2022 unrest.Mchunu pleaded not guilty to charges that stem from the incitement allegations.He also claimed he had permission to travel to Gauteng for work during the Covid-19 lockdown.""They are lying. I know nothing about those charges,"" alleged July unrest instigator, former Ukhozi FM DJ Ngizwe Brian Mchunu, told the Randburg Magistrate's Court when his trial started on Tuesday.Mchunu pleaded not guilty to the charges against him, which stems from allegations that he incited public violence, leading to the destruction of property, and convened a gathering in contravening of Covid-19 lockdown regulations after the detention of former president Jacob Zuma in July last year.In his plea statement, Mchunu said all charges were explained to him.""I am a radio and television personality. I'm also a poet and cultural activist. In 2020, I established my online radio.READ | July unrest instigators could face terrorism-related charges""On 11 July 2021, I sent invitations to journalists to discuss the then-current affairs. At the time, it was during the arrest of Zuma.""I held a media briefing at a hotel in Bryanston to show concerns over Zuma's arrest. Zuma is my neighbour [in Nkandla]. In my African culture, I regard him as my father.""Mchunu continued that he was not unhappy about Zuma's arrest but added: ""I didn't condone any violence. I pleaded with fellow Africans to stop destroying infrastructure. I didn't incite any violence.I said to them, 'My brothers and sisters, I'm begging you as we are destroying our country.'""He added:They are lying. I know nothing about those charges. I am innocent. He also claimed that he had permission to travel to Gauteng for work during the lockdown.The hearing continues."
2,Jukskei River baptism drownings: Pastor of informal 'church' goes to ground,5h ago,"A pastor of the church where congregants drowned during a baptism ceremony has gone to ground.Johannesburg Emergency Medical Services said his identity was not known.So far, 14 bodies have been retrieved from the river.According to Johannesburg Emergency Medical Services (EMS), 13 of the 14 bodies retrieved from the Jukskei River have been positively identified.The bodies are of congregants who were swept away during a baptism ceremony on Saturday evening in Alexandra.The search for the other missing bodies continues.Reports are that the pastor of the church survived the flash flood after congregants rescued him.READ | Jukskei River baptism: Families gather at mortuary to identify loved onesEMS spokesperson Robert Mulaudzi said they had been in contact with the pastor since the day of the tragedy, but that they had since lost contact with him.It is alleged that the pastor was not running a formal church, but rather used the Jukskei River as a place to perform rituals for people who came to him for consultations. At this stage, his identity is not known, and because his was not a formal church, Mulaudzi could not confirm the number of people who could have been attending the ceremony.Speaking to the media outside the Sandton fire station on Tuesday morning, a member of the rescue team, Xolile Khumalo, said: Thirteen out of the 14 bodies retrieved have been identified, and the one has not been identified yet.She said their team would continue with the search. ""Three families have since come forward to confirm missing persons, and while we cannot be certain that the exact number of bodies missing is three, we will continue with our search."""
3,Six-month-old infant ‘abducted’ in Somerset West CBD,9h ago,"Authorities are on high alert after a baby was allegedly abducted in Somerset West on Monday.The alleged incident occurred around lunchtime, but was only reported to Somerset West police around 22:00. According to Sergeant Suzan Jantjies, spokesperson for Somerset West police, the six-month-old baby boy was taken around 13:00. It is believed the infant’s mother, a 22-year-old from Nomzamo, entrusted a fellow community member and mother with the care of her child before leaving for work on Monday morning. However, when she returned home from work, she was informed that the child was taken. Police were apparently informed that the carer, the infant and her nine-year-old child had travelled to Somerset West CBD to attend to Sassa matters. She allegedly stopped by a liquor store in Victoria Street and asked an unknown woman to keep the baby and watch over her child. After purchasing what was needed and exiting the store, she realised the woman and the children were missing. A case of abduction was opened and is being investigated by the police’s Family Violence, Child Protection and Sexual Offences (FCS) unit. Police obtained security footage which shows the alleged abductor getting into a taxi and making off with the children. The older child was apparently dropped off close to her home and safely returned. However, the baby has still not been found. According to a spokesperson, FCS police members prioritised investigations immediately after the case was reported late last night and descended on the local township, where they made contact with the visibly “traumatised” parent and obtained statements until the early hours of Tuesday morning – all in hopes of locating the child and the alleged suspect.Authorities are searching for what is believed to be a foreign national woman with braids, speaking isiZulu.Anyone with information which could aid the investigation and search, is urged to call Captain Trevor Nash of FCS on 082 301 8910."
4,Have you herd: Dubai businessman didn't know Ramaphosa owned Phala Phala buffalo he bought - report,8h ago,"A Dubai businessman who bought buffaloes at Phala Phala farm reportedly claims he did not know the deal was with President Cyril Ramaphosa.Hazim Mustafa also claimed he was expecting to be refunded for the livestock after the animals were not delivered.He reportedly brought the cash into the country via OR Tambo International Airport, and claims he declared it.A Dubai businessman who reportedly bought 20 buffaloes from President Cyril Ramaphosa's Phala Phala farm claims that he didn't know the deal with was with the president, according to a report.Sky News reported that Hazim Mustafa, who reportedly paid $580 000 (R10 million) in cash for the 20 buffaloes from Ramaphosa's farm in December 2019, said initially he didn't know who the animals belonged to.A panel headed by former chief justice Sandile Ngcobo released a report last week after conducting a probe into allegations of a cover-up of a theft at the farm in February 2020.READ | Ramaphosa wins crucial NEC debate as parliamentary vote on Phala Phala report delayed by a weekThe panel found that there was a case for Ramaphosa to answer and that he may have violated the law and involved himself in a conflict between his official duties and his private business.In a statement to the panel, Mustafa was identified as the source of the more than $500 000 (R9 million) that was stolen from the farm. Among the evidence was a receipt for $580 000 that a Phala Phala employee had written to ""Mr Hazim"".According to Sky News, Mustafa said he celebrated Christmas and his wife's birthday in Limpopo in 2019, and that he dealt with a broker when he bought the animals.He reportedly said the animals were to be prepared for export, but they were never delivered due to the Covid-19 lockdown. He understood he would be refunded after the delays.He also reportedly brought the cash into the country through OR Tambo International Airport and said he declared it. Mustafa also told Sky News that the amount was ""nothing for a businessman like [him]"".READ | Here's the Sudanese millionaire - and his Gucci wife - who bought Ramaphosa's buffaloThe businessman is the owner Sudanese football club Al Merrikh SC. He is married to Bianca O'Donoghue, who hails from KwaZulu-Natal. O'Donoghue regularly takes to social media to post snaps of a life of wealth – including several pictures in designer labels and next to a purple Rolls Royce Cullinan, a luxury SUV worth approximately R5.5 million.Sudanese businessman Hazim Mustafa with his South African-born wife, Bianca O'Donoghue.Facebook PHOTO: Bianca O'Donoghue/Facebook News24 previously reported that he also had ties to former Sudanese president, Omar al-Bashir.There have been calls for Ramaphosa to step down following the saga. A motion of no confidence is expected to be submitted in Parliament.He denied any wrongdoing and said the ANC's national executive committee (NEC) would decide his fate.Do you have a tipoff or any information that could help shape this story? Email [email protected]"
5,Hefty prison sentence for man who killed stranded KZN cop while pretending to offer help,9h ago,"Two men have been sentenced – one for the murder of a KwaZulu-Natal police officer, and the other for an attempt to rob the officer.Sergeant Mzamiseni Mbele was murdered in Weenen in April last year.He was attacked and robbed when his car broke down on the highway while he was on his way home.A man who murdered a KwaZulu-Natal police officer, after pretending that he wanted to help him with his broken-down car, has been jailed.A second man, who was only convicted of an attempt to rob the officer, has also been sentenced to imprisonment.On Friday, the KwaZulu-Natal High Court in Madadeni sentenced Sboniso Linda, 36, to an effective 25 years' imprisonment, and Nkanyiso Mungwe, 25, to five years' imprisonment.READ | Alleged house robber shot after attack at off-duty cop's homeAccording to Hawks spokesperson, Captain Simphiwe Mhlongo, 39-year-old Sergeant Mzamiseni Mbele, who was stationed at the Msinga police station, was on his way home in April last year when his car broke down on the R74 highway in Weenen.Mbele let his wife know that the car had broken down. While stationary on the road, Linda and Mungwe approached him and offered to help.Mhlongo said: All of a sudden, [they] severely assaulted Mbele. They robbed him of his belongings and fled the scene. A farm worker found Mbele's body the next dayA case of murder was reported at the Weenen police station and the Hawks took over the investigation.The men were arrested.""Their bail [application] was successfully opposed and they appeared in court several times until they were found guilty,"" Mhlongo added.How safe is your neighbourhood? Find out by using News24's CrimeCheckLinda was sentenced to 20 years' imprisonment for murder and 10 years' imprisonment for robbery with aggravating circumstances. Half of the robbery sentence has to be served concurrently, leaving Linda with an effective sentence of 25 years.Mungwe was sentenced to five years' imprisonment for attempted robbery with aggravating circumstances."
Hope this helps. Happy coding :)