Webscraping with beautifulsoup get text from all paragraphs in the div and add it to list-CodePudding

I have a following problem, I would like to get all paragraphs from a certain div and put them into the list so that all of those paragraphs will ac as one entry into the list.

Following code makes it so that I have more than one entry

from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
from fake_useragent import UserAgent
import random

ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
url = 'https://www.caranddriver.com/audi/a4-2017' 
response = requests.get(url, headers=header)
print(response)
html_soup = BeautifulSoup(response.text, 'lxml')
article = html_soup.find('div', attrs={'class': 'review-body-content'}).findAll('p')
article_text = []

for element in article:
  article_text.append('\n'   ''.join(element.findAll(text = True)))

the output of len function on the article_text list is 10 and I would like it to be 1. I would like to make my code automatically write more reviews into the list so that when I change the year to next one (e.g. 2018) all paragraphs from the review in that year (2018) would become a second entry in the table. Overall I wanted to created a pandas dataframe which would contain review for a certain year as a seperate row.

CodePudding user response：

Keeping within your code, all you have to do is defining article_text as a string, and adding each paragraph to it, and at the end appending as a single item to the list (if I understand your issue correctly):

from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
from fake_useragent import UserAgent
import random

articles = []
ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
for i in range(2017, 2019):
    url = f'https://www.caranddriver.com/audi/a4-{i}' 
    response = requests.get(url, headers=header)
    print(response)
    html_soup = BeautifulSoup(response.text, 'lxml')
    article = html_soup.find('div', attrs={'class': 'review-body-content'}).findAll('p')
    article_text = ''

    for element in article:
      article_text = article_text   '\n'   ''.join(element.findAll(text = True))
    articles.append(article_text)
    print(article_text[:50])
    print('_______________')
print('Items in articles list:', len(articles))
print(articles)

This would return:

<Response [200]>

The A4 embodies everything we love about Audi: st.. ## 2017 review
_______________
<Response [200]>

The 2018 Audi A4 is perhaps the most well-rounded .. ## 2018 review
_______________
Items in articles list: 2
["\nThe A4 embodies everything we love about Audi: strong performance, modern design, advanced technology, and sumptuous comfort. ...] ## articles list with 2 items in it