Web scaping selected href with beautifulsoup-CodePudding

I want to webscrape a website, including this article with python / beautfulsoup: https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/

At the end of each article you always find the sources. In the case of the link above, this is: Source picture

In some articles on this website only one source is given, but sometimes two or three different ones. So the code needs to consider that.

Ideally I want the following output format: "text (href)"

xchuxing.com (https://xchuxing.com/article/45850)
cnevpost.com (https://cnevpost.com/2022/02/18/byd-seal-set-to-become-new-tesla-model-3-challenger/)

Here is my first code:

from bs4 import BeautifulSoup
import requests
import csv

URL = 'https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/'
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
article = soup.find()

source = [c for c in article.find('section', class_='content').find_all('a')]
for link in source[3:]:
        link.get('href')
print (link)

Output as of now:

<a href="https://cnevpost.com/2022/02/18/byd-seal-set-to-become-new-tesla-model-3-challenger/" rel="noopener" target="_blank">cnevpost.com</a>
[Finished in 345ms]

Thank you!

CodePudding user response：

Getting all links using css selectors

from bs4 import BeautifulSoup
import requests
import csv

URL = 'https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/'
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
      

for link in soup.select('.content a'):
    print(link.get('href'))

Output:

https://www.electrive.com/2021/04/20/byd-presents-800-volt-platform/
https://www.electrive.com/2021/09/13/byd-presents-electric-sedan-concept-ocean/
https://www.electrive.com/2020/03/30/byd-reveals-blade-battery-focussed-on-safety/
https://www.electrive.com/2021/03/16/byd-reveals-additional-blade-battery-specifications/     
https://xchuxing.com/article/45850
https://cnevpost.com/2022/02/18/byd-seal-set-to-become-new-tesla-model-3-challenger/
http://electrive.com/

CodePudding user response：

I think the sources always is in the last paragraph of the article, so do as below to extract them :

from bs4 import BeautifulSoup
import requests
import csv

URL = 'https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/'
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')


paragraphs = soup.find('section', class_='content').find_all('p')
# the sources in the last paragraph
sources = paragraphs[-1].find_all('a')
# put the sources name and link in a dict
sources_links = []
for source in sources:
    sources_links.append((source.text, source['href']))

for l in sources_links:
    print(l)

# write in csv
with open('electrive_scrape_source.csv', 'w') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(['Source', 'Link'])
    csv_writer.writerows(sources_links)

Save the data to csv file

CodePudding user response：

Follow up question: I want to write the result in a csv file, but with the code it only writes the first source, but not the second.

from bs4 import BeautifulSoup
import requests
import csv

URL = 'https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/'
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')

csv_header = ['Source']
csv_file = open('electrive_scrape_source.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(csv_header)

paragraphs = soup.find('section', class_='content').find_all('p')
sources = paragraphs[-1].find_all('a')
sources_links = {}
for source in sources:
    sources_links[source.text] = source['href']
for l in sources_links:
    source_output = (l  ' ' '('   sources_links[l]   ')')
csv_data = [source_output]
csv_writer.writerow(csv_data)
csv_file.close()