With the help of 'Life is complex' I have managed to scrape data from CNN newswebsite. The data (URLs) extracted from are saved in a .csv file (test1). Note this had been done manually as it was easier to do!
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
import csv
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
with open('test1.csv', 'r') as file:
csv_file = file.readlines()
for url in csv_file:
try:
article = Article(url.strip(), config=config)
article.download()
article.parse()
print(article.title)
article_text = article.text.replace('\n', ' ')
print(article.text)
except ArticleException:
print('***FAILED TO DOWNLOAD***', article.url)
with open('test2.csv', 'a', newline='') as csvfile:
headers = ['article title', 'article text']
writer = csv.DictWriter(csvfile, lineterminator='\n', fieldnames=headers)
writer.writeheader()
writer.writerow({'article title': article.title,
'article text': article.text})
With the code above I manage to scrape the actual news information (title and content) from the URLs and also to export it to a .csv file. Only the issue with the export is, is that it only exports the last title and text (therefore I think it keeps overwriting the info on the first row)
How can I get all the titles and content in the csv file?
CodePudding user response:
Thanks for giving me a shout out.
The code below should help you solve your CSV write issue. If it doesn't just let me know and I will rework my answer.
P.S. I will update my Newspaper3k overview document with more details on writing CSV files.
P.P.S. I'm current writing a new news scraper, because the development for Newspaper3k is dead. I'm unsure of the release date of my code.
import csv
from newspaper import Config
from newspaper import Article
from os.path import exists
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
urls = ['https://www.cnn.com/2021/10/25/tech/facebook-papers/index.html', 'https://www.cnn.com/entertainment/live-news/rust-shooting-alec-baldwin-10-25-21/h_257c62772a2b69cb37db397592971b58']
for url in urls:
article = Article(url, config=config)
article.download()
article.parse()
article_meta_data = article.meta_data
published_date = {value for (key, value) in article_meta_data.items() if key == 'pubdate'}
article_published_date = " ".join(str(x) for x in published_date)
file_exists = exists('cnn_extraction_results.csv')
if not file_exists:
with open('cnn_extraction_results.csv', 'w', newline='') as file:
headers = ['date published', 'article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writeheader()
writer.writerow({'date published': article_published_date,
'article title': article.title,
'article text': article.text})
else:
with open('cnn_extraction_results.csv', 'a', newline='') as file:
headers = ['date published', 'article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writerow({'date published': article_published_date,
'article title': article.title,
'article text': article.text})