Web Scraping & BeautifulSoup <li> parsing-CodePudding

I'm just learning web scraping & want to output the result of this website to a csv file https://www.avbuyer.com/aircraft/private-jets

but am struggling with year, sn & time field in the below code - when I put "soup" in place of "post" it works but not when I want to put them together any help would be much appreciated

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.avbuyer.com/aircraft/private-jets'

page = requests.get(url)
page

soup = BeautifulSoup(page.text, 'lxml')
soup

df = pd.DataFrame({'Plane':[''], 'Year':[''], 'S/N':[''], 'Total Time':[''], 'Price':[''], 'Location':[''], 'Description':[''], 'Tag':[''], 'Last updated':[''], 'Link':['']})

while True:
    
    postings = soup.find_all('div', class_ = 'listing-item premium')
    for post in postings:
        try:
            link = post.find('a', class_ = 'more-info').get('href')
            link_full = 'https://www.avbuyer.com'  link
            plane = post.find('h2', class_ = 'item-title').text
            price = post.find('div', class_ = 'price').text
            location = post.find('div', class_ = 'list-item-location').text
            year = post.find_all('ul', class_ = 'fa-no-bullet clearfix')[2]
            year.find_all('li')[0].text
            sn = post.find('ul', class_ = 'fa-no-bullet clearfix')[2]
            sn.find('li')[1].text
            time = post.find('ul', class_ = 'fa-no-bullet clearfix')[2]
            time.find('li')[2].text
            desc = post.find('div', classs_ = 'list-item-para').text
            tag = post.find('div', class_ = 'list-viewing-date').text
            updated = post.find('div', class_ = 'list-update').text
            df = df.append({'Plane':plane, 'Year':year, 'S/N':sn, 'Total Time':time, 'Price':price, 'Location':location,
                            'Description':desc, 'Tag':tag, 'Last updated':updated, 'Link':link_full}, ignore_index = True)
       
        
        except:
            pass
        
                          
        
    next_page = soup.find('a', {'rel':'next'}).get('href')
    next_page_full = 'https://www.avbuyer.com' next_page
    next_page_full

    url = next_page_full
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'lxml')  

df.to_csv('/Users/xxx/avbuyer.csv')

CodePudding user response：

Try this:

import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.avbuyer.com/aircraft/private-jets')
soup = BeautifulSoup(response.content, 'html.parser')
postings = soup.find_all('div', class_ = 'listing-item premium')
temp=[]
for post in postings:
    link = post.find('a', class_ = 'more-info').get('href')
    link_full = 'https://www.avbuyer.com'  link
    plane = post.find('h2', class_ = 'item-title').text
    price = post.find('div', class_ = 'price').text
    location = post.find('div', class_ = 'list-item-location').text
    t=post.find_all('div',class_='list-other-dtl')
    for i in t:
        data=[tup.text for tup in i.find_all('li')]
        years=data[0]
        s=data[1]
        total_time=data[2]

        temp.append([plane,price,location,link_full,years,s,total_time])

df=pd.DataFrame(temp,columns=["plane","price","location","link","Years","S/N","Totaltime"])
print(df)

output:

                         plane                  price                                           location                                               link      Years           S/N           Totaltime
0     Dassault Falcon 2000LXS              Make offer  North America   Canada, United States - MD, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 2021       S/N 377       Total Time 33
1        Cirrus Vision SF50 G1           Please call   North America   Canada, United States - WI, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 2018      S/N 0080      Total Time 615
2               Gulfstream IV              Make offer  North America   Canada, United States - MD, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 1990      S/N 1148     Total Time 6425
4                Boeing 787-8              Make offer      Europe, Monaco, For Sale by Global Jet Monaco  https://www.avbuyer.com/aircraft/private-jets/...  Year 2010         S/N -        Total Time 1
5                 Hawker 4000              Make offer      South America, Puerto Rico, For Sale by JetHQ  https://www.avbuyer.com/aircraft/private-jets/...  Year 2009     S/N RC-24     Total Time 2120
6          Embraer Legacy 500              Make offer  North America   Canada, United States - NE, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 2015  S/N 55000016     Total Time 2607
7     Dassault Falcon 2000LXS              Make offer  North America   Canada, United States - DE, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 2015       S/N 300     Total Time 1909
8        Dassault Falcon 50EX            Please call   North America   Canada, United States - TX, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 2002       S/N 320   Total Time 7091.9
9        Dassault Falcon 2000              Make offer  North America   Canada, United States - MD, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 2001       S/N 146     Total Time 6760
10      Bombardier Learjet 75              Make offer          Europe, Switzerland, For Sale by Jetcraft  https://www.avbuyer.com/aircraft/private-jets/...  Year 2014    S/N 45-491     Total Time 1611
11                Hawker 800B            Please call   Europe, United Kingdom - England, For Sale by ...  https://www.avbuyer.com/aircraft/private-jets/...  Year 1985    S/N 258037     Total Time 9621
13             BAe Avro RJ100            Please call   North America   Canada, United States - MT, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 1996     S/N E3282    Total Time 45996
14         Embraer Legacy 600              Make offer  North America   Canada, United States - MD, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 2007  S/N 14501014     Total Time 4328
15  Bombardier Challenger 850              Make offer  North America   Canada, United States - AZ, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 2003      S/N 7755  Total Time 12114.1
16            Gulfstream G650            Please call           Europe, Switzerland, For Sale by Jetcraft  https://www.avbuyer.com/aircraft/private-jets/...  Year 2013      S/N 6047     Total Time 2178
17      Bombardier Learjet 55     Price: USD $995,000  North America   Canada, United States - MD, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 1982       S/N 020    Total Time 13448
18         Dassault Falcon 8X            Please call   North America   Canada, United States - MD, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 2016       S/N 406     Total Time 1627
19               Hawker 800XP   Price: USD $1,595,000  North America   Canada, United States - MD, Fo...  https://www.avbuyer.com/aircraft/private-jets/...  Year 2002    S/N 258578    Total Time 10169

CodePudding user response：

Right now, your try-except clauses are not allowing you to see and debug the errors in your script. If you remove them, you will see:

IndexError: list index out of range in line 24. There are only two elements inside the list, and you are looking for the second one. Therefore, your line should be:

year = post.find_all('ul', class_ = 'fa-no-bullet clearfix')[1]
KeyError: 2 in line 26. You are using find(), which returns a <class 'bs4.element.Tag'> object, not a list. Here you want to use find_all() as you did in line 24. Same happens for line 28.

However, instead of using this expression three times, you should rather store the result in a variable and use it later.
AttributeError: 'NoneType' object has no attribute 'text' in line 31. There is a type, you wrote _classs.
AttributeError: 'NoneType' object has no attribute 'text' in line 32. There is nothing wrong with your code. Instead, there are some entries in the webpage that don't have this element. You should check if the find method gave you any result.
```
tag = post.find('div', class_ = 'list-viewing-date')
if tag:
  tag = tag.text
else:
  tag = None
```
You don't have a way out of your while loop. You should catch whenever the script cannot find a new next_page and add a break.

After changing all this, it worked for me to scrape the first page. I used:

Python 3.9.7
bs4 4.10.0

It is very important that you state what versions of Python and the libraries you are using.

Cheers!