Home > OS >  Webscraping with beautifulsoup get text from all paragraphs in the div and add it to list
Webscraping with beautifulsoup get text from all paragraphs in the div and add it to list

Time:08-10

I have a following problem, I would like to get all paragraphs from a certain div and put them into the list so that all of those paragraphs will ac as one entry into the list.

Following code makes it so that I have more than one entry

from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
from fake_useragent import UserAgent
import random

ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
url = 'https://www.caranddriver.com/audi/a4-2017' 
response = requests.get(url, headers=header)
print(response)
html_soup = BeautifulSoup(response.text, 'lxml')
article = html_soup.find('div', attrs={'class': 'review-body-content'}).findAll('p')
article_text = []

for element in article:
  article_text.append('\n'   ''.join(element.findAll(text = True)))

the output of len function on the article_text list is 10 and I would like it to be 1. I would like to make my code automatically write more reviews into the list so that when I change the year to next one (e.g. 2018) all paragraphs from the review in that year (2018) would become a second entry in the table. Overall I wanted to created a pandas dataframe which would contain review for a certain year as a seperate row.

CodePudding user response:

Keeping within your code, all you have to do is defining article_text as a string, and adding each paragraph to it, and at the end appending as a single item to the list (if I understand your issue correctly):

from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
from fake_useragent import UserAgent
import random

articles = []
ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
for i in range(2017, 2019):
    url = f'https://www.caranddriver.com/audi/a4-{i}' 
    response = requests.get(url, headers=header)
    print(response)
    html_soup = BeautifulSoup(response.text, 'lxml')
    article = html_soup.find('div', attrs={'class': 'review-body-content'}).findAll('p')
    article_text = ''

    for element in article:
      article_text = article_text   '\n'   ''.join(element.findAll(text = True))
    articles.append(article_text)
    print(article_text[:50])
    print('_______________')
print('Items in articles list:', len(articles))
print(articles)

This would return:

<Response [200]>

The A4 embodies everything we love about Audi: st.. ## 2017 review
_______________
<Response [200]>

The 2018 Audi A4 is perhaps the most well-rounded .. ## 2018 review
_______________
Items in articles list: 2
["\nThe A4 embodies everything we love about Audi: strong performance, modern design, advanced technology, and sumptuous comfort. ...] ## articles list with 2 items in it
  • Related