I want to webscrape a website, including this article with python / beautfulsoup: https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/
At the end of each article you always find the sources. In the case of the link above, this is: Source picture
In some articles on this website only one source is given, but sometimes two or three different ones. So the code needs to consider that.
Ideally I want the following output format: "text (href)"
xchuxing.com (https://xchuxing.com/article/45850)
cnevpost.com (https://cnevpost.com/2022/02/18/byd-seal-set-to-become-new-tesla-model-3-challenger/)
Here is my first code:
from bs4 import BeautifulSoup
import requests
import csv
URL = 'https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/'
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
article = soup.find()
source = [c for c in article.find('section', class_='content').find_all('a')]
for link in source[3:]:
link.get('href')
print (link)
Output as of now:
<a href="https://cnevpost.com/2022/02/18/byd-seal-set-to-become-new-tesla-model-3-challenger/" rel="noopener" target="_blank">cnevpost.com</a>
[Finished in 345ms]
Thank you!
CodePudding user response:
Getting all links using css selectors
from bs4 import BeautifulSoup
import requests
import csv
URL = 'https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/'
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
for link in soup.select('.content a'):
print(link.get('href'))
Output:
https://www.electrive.com/2021/04/20/byd-presents-800-volt-platform/
https://www.electrive.com/2021/09/13/byd-presents-electric-sedan-concept-ocean/
https://www.electrive.com/2020/03/30/byd-reveals-blade-battery-focussed-on-safety/
https://www.electrive.com/2021/03/16/byd-reveals-additional-blade-battery-specifications/
https://xchuxing.com/article/45850
https://cnevpost.com/2022/02/18/byd-seal-set-to-become-new-tesla-model-3-challenger/
http://electrive.com/
CodePudding user response:
I think the sources always is in the last paragraph of the article, so do as below to extract them :
from bs4 import BeautifulSoup
import requests
import csv
URL = 'https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/'
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
paragraphs = soup.find('section', class_='content').find_all('p')
# the sources in the last paragraph
sources = paragraphs[-1].find_all('a')
# put the sources name and link in a dict
sources_links = []
for source in sources:
sources_links.append((source.text, source['href']))
for l in sources_links:
print(l)
# write in csv
with open('electrive_scrape_source.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Source', 'Link'])
csv_writer.writerows(sources_links)
Save the data to csv file
CodePudding user response:
Follow up question: I want to write the result in a csv file, but with the code it only writes the first source, but not the second.
from bs4 import BeautifulSoup
import requests
import csv
URL = 'https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/'
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
csv_header = ['Source']
csv_file = open('electrive_scrape_source.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(csv_header)
paragraphs = soup.find('section', class_='content').find_all('p')
sources = paragraphs[-1].find_all('a')
sources_links = {}
for source in sources:
sources_links[source.text] = source['href']
for l in sources_links:
source_output = (l ' ' '(' sources_links[l] ')')
csv_data = [source_output]
csv_writer.writerow(csv_data)
csv_file.close()