I have a following problem, I would like to get all paragraphs from a certain div and put them into the list so that all of those paragraphs will ac as one entry into the list.
Following code makes it so that I have more than one entry
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
from fake_useragent import UserAgent
import random
ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
url = 'https://www.caranddriver.com/audi/a4-2017'
response = requests.get(url, headers=header)
print(response)
html_soup = BeautifulSoup(response.text, 'lxml')
article = html_soup.find('div', attrs={'class': 'review-body-content'}).findAll('p')
article_text = []
for element in article:
article_text.append('\n' ''.join(element.findAll(text = True)))
the output of len function on the article_text list is 10 and I would like it to be 1. I would like to make my code automatically write more reviews into the list so that when I change the year to next one (e.g. 2018) all paragraphs from the review in that year (2018) would become a second entry in the table. Overall I wanted to created a pandas dataframe which would contain review for a certain year as a seperate row.
CodePudding user response:
Keeping within your code, all you have to do is defining article_text as a string, and adding each paragraph to it, and at the end appending as a single item to the list (if I understand your issue correctly):
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
from fake_useragent import UserAgent
import random
articles = []
ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
for i in range(2017, 2019):
url = f'https://www.caranddriver.com/audi/a4-{i}'
response = requests.get(url, headers=header)
print(response)
html_soup = BeautifulSoup(response.text, 'lxml')
article = html_soup.find('div', attrs={'class': 'review-body-content'}).findAll('p')
article_text = ''
for element in article:
article_text = article_text '\n' ''.join(element.findAll(text = True))
articles.append(article_text)
print(article_text[:50])
print('_______________')
print('Items in articles list:', len(articles))
print(articles)
This would return:
<Response [200]>
The A4 embodies everything we love about Audi: st.. ## 2017 review
_______________
<Response [200]>
The 2018 Audi A4 is perhaps the most well-rounded .. ## 2018 review
_______________
Items in articles list: 2
["\nThe A4 embodies everything we love about Audi: strong performance, modern design, advanced technology, and sumptuous comfort. ...] ## articles list with 2 items in it