Home > Blockchain >  Having two divs under the same class take the content of the fist and second seperately webscraping
Having two divs under the same class take the content of the fist and second seperately webscraping

Time:04-12

I have such a html page inside the content_list variable

<h3 >Problems with battery capacity long-term</h3>
<div >
<div>July 21, 2014</div>
<div>By Cathie from San Diego</div>
<div ><strong>Owns this car</strong></div>
</div>
<div >
<p >We have owned our Leaf since May 2011. We have loved the car but are now getting quite concerned. My husband drives the car, on average, 20-40 miles/day to and from work and running errands, mostly 100% on city roads. We live in San Diego, so no issue with winter weather and we live 7 miles from the ocean so seldom have daytime temperatures above 85. Originally, we would get 65-70 miles per 80-90% charge. Last fall we noticed that there was considerably less remaining charge left after a day of driving. He began to track daily miles, remaining "bars", as well as started charging it 100%. For 9 months we have only been getting 40-45 miles on a full charge with only 1-2 "bars" remaining at the end of the day. Sometimes it will be blinking and "talking" to us to get to a charging place ASAP. We just had it into the dealership. Though on a full charge, the car gauge shows 12 bars, the dealership states that the batteries have lost 2 bars via the computer diagnostics (which we are told is a different reading from the car gauge itself) and, that they say, is average and excepted for the car at this age. Everything else (software, diagnostics, etc.) shows 100%, so the dealership thinks that the car is functioning as it should. They are unable to explain why we can only go 40-45 miles on a charge, but keep saying that the car tests out fine. If the distance one is able to drive on a full charge decreases any further, it will begin to render the car useless. As someone else recommended, in retrospect, the best way to go is to lease the Leaf so that battery life is not an issue.</p>
</div>

First I used this code to get to the collection of reviews

ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
url = 'https://www.cars.com/research/nissan-leaf-2011/consumer-reviews/?page=1'
response = requests.get(url, headers=header)
print(response)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'consumer-review-container'})

Now I would like to take the value of date of the review and the name of the reviewer which in this case would be

<div >
    <div>July 21, 2014</div>
    <div>By Cathie from San Diego</div>

The problem is I can't separate those two divs

My code:

data = []

for e in content_list:
    data.append({
      'review_date':e.find_all("div", {"class":"review-byline"})[0].text,
      'overall_rating': e.select_one('span.sds-rating__count').text,
      'review_title': e.h3.text,
      'review_content': e.select_one('p').text,
      
    })

The result of my code

{'overall_rating': '4.7',
 'review_content': 'This is the perfect electric car for driving around town, doing errands or even for a short daily commuter. It is very comfy and very quick. The only issue was the first gen battery. The 2011-2014 battery degraded quickly and if the owner did not have Nissan replace it, all those cars are now junk and can only go 20 miles or so on a charge. We had Nissan replace our battery with the 2nd gen battery and it is good as new!',
 'review_date': '\nFebruary 24, 2020\nBy EVs are the future from Tucson, AZ\nOwns this car\n',
 'review_title': 'Great Electric Car!'}

CodePudding user response:

For the first one you could the <div> directly:

  'review_date':e.find("div", {"class":"review-byline"}).div.text,

for the second one use e.g. css selector:

  'reviewer_name':e.select_one("div.review-byline div:nth-of-type(2)").text,

Example

url = 'https://www.cars.com/research/nissan-leaf-2011/consumer-reviews/?page=1'
response = requests.get(url, headers=header)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'consumer-review-container'})

data = []

for e in content_list:
    data.append({
      'review_date':e.find("div", {"class":"review-byline"}).div.text,
      'reviewer_name':e.select_one("div.review-byline div:nth-of-type(2)").text,
      'overall_rating': e.select_one('span.sds-rating__count').text,
      'review_title': e.h3.text,
      'review_content': e.select_one('p').text,
      
    })
data
  • Related