Home > Enterprise >  How to web scrape find_all as text/string - Python web scraping question
How to web scrape find_all as text/string - Python web scraping question

Time:02-22

I would like to scrap a website. Can you please tell me how to get only the text of the output in this format: "BEV, Enyaq Coupé iV vRS, Skoda, UK, Volkswagen"? Currently, my output also includes the HTML tags and so on.

Thanks for your inputs!

from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://www.electrive.com/2022/02/13/skoda-reveals-uk-pricing-for-enyaq-coupe-iv-vrs/').text
soup = BeautifulSoup(source, 'lxml')

article = soup.find()

tags2 = article.find_all('div', class_='tags')
print (tags2)

Output:

[<div >
<a href="https://www.electrive.com/tag/bev/" rel="tag">BEV</a><a href="https://www.electrive.com/tag/enyaq-coupe-iv-vrs/" rel="tag">Enyaq Coupé iV vRS</a><a href="https://www.electrive.com/tag/skoda/" rel="tag">Skoda</a><a href="https://www.electrive.com/tag/uk/" rel="tag">UK</a><a href="https://www.electrive.com/tag/volkswagen/" rel="tag">Volkswagen</a> </div>]
[Finished in 580ms]

CodePudding user response:

You have to select your elements more specific, cause information is in <a> and iterate over the ResultSet for example with list comprehension:

tags2 = [e.text for e in soup.find('div', class_='tags').find_all('a')]

Alternativ use of css selectors:

tags2 = [e.text for e in soup.select('div.tags a')]

#output
['BEV', 'Enyaq Coupé iV vRS', 'Skoda', 'UK', 'Volkswagen']

If you like to get a string instead of a list, just join() the elements:

tags2 = ','.join([e.text for e in soup.find('div', class_='tags').find_all('a')])

#output
BEV,Enyaq Coupé iV vRS,Skoda,UK,Volkswagen

CodePudding user response:

Follow-question: same webpage but this time I want to scrape this sources listed in the article. Any chance to get them in this format: "skodamedia.com (https://skodamedia.com/en-gb/releases/1297), skodamedia.com (https://skodamedia.com/en-gb/releases/1296)"

Thank you

source2 = [c for c in article.find('section', class_='content').find_all('a')]
print(source2)

Output:

[<a  href="https://www.electrive.com/wp-content/uploads/2022/02/skoda-enyaq-coupe-iv-grossbritannien-uk-2022-01-min.png" target="_blank">
<img alt=""  height="150" loading="lazy" sizes="(max-width: 300px) 100vw, 300px" src="https://www.electrive.com/wp-content/uploads/2022/02/skoda-enyaq-coupe-iv-grossbritannien-uk-2022-01-min-300x150.png" srcset="https://www.electrive.com/wp-content/uploads/2022/02/skoda-enyaq-coupe-iv-grossbritannien-uk-2022-01-min-300x150.png 300w, https://www.electrive.com/wp-content/uploads/2022/02/skoda-enyaq-coupe-iv-grossbritannien-uk-2022-01-min-444x222.png 444w, https://www.electrive.com/wp-content/uploads/2022/02/skoda-enyaq-coupe-iv-grossbritannien-uk-2022-01-min-888x444.png 888w, https://www.electrive.com/wp-content/uploads/2022/02/skoda-enyaq-coupe-iv-grossbritannien-uk-2022-01-min-768x384.png 768w, https://www.electrive.com/wp-content/uploads/2022/02/skoda-enyaq-coupe-iv-grossbritannien-uk-2022-01-min.png 1500w" width="300"/> </a>, <a href="https://www.electrive.com/2022/01/31/skoda-presents-the-new-enyaq-coupe-iv/">Enyaq Coupé</a>, <a href="https://www.electrive.com/2020/09/01/skoda-enyaq-iv-first-meb-suv-comes-from-the-czech-republic/">September 2020</a>, <a href="https://skodamedia.com/en-gb/releases/1297">skodamedia.com</a>, <a href="https://skodamedia.com/en-gb/releases/1296">skodamedia.com</a>]
  • Related