Home > Back-end >  Web Scraping Contents In A List Wrapped Inside A Class With Python
Web Scraping Contents In A List Wrapped Inside A Class With Python

Time:12-02

I am trying to extract all the items from a list on this enter image description here

Code

import bs4, requests
import pandas as pd

wagon_stock_url = 'https://parramattamg.com.au/up4053-961230-mg-hs-2020.html'

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like 
Gecko) Chrome/96.0.4664.45 Safari/537.36'
}


response = requests.get(wagon_stock_url, headers = headers)
soup = bs4.BeautifulSoup(response.text, 'html.parser')  

name = soup.select(".stockItemInfo").

I know soup.select(".stockItemInfo") just select the class items as a list, but how to get the each item over the iteration?

CodePudding user response:

Your close to a solution - Just add an li to your css selector, what will give you a result set of all the list elements:

name = soup.select(".stockItemInfo li")

--> [<li>    <span><strong>Vehicle</strong></span>: 2020 MG HS      </li>, <li>    <span><strong>Series</strong></span>: SAS23 MY20     </li>, <li>    <span><strong>Badge</strong></span>: Vibe DCT FWD        </li>, <li>    <span><strong>Colour</strong></span>: White      </li>, <li>    <span><strong>Odometer</strong></span>: 11,213kms        </li>, <li>    <span><strong>Body</strong></span>: Wagon        </li>, <li>    <span><strong>Engine</strong></span>: 1.5 litre, 4-cylinder      </li>, <li>    <span><strong>Fuel Type</strong></span>: Petrol      </li>, <li>    <span><strong>Transmission</strong></span>: 7-speed Automatic        </li>, <li>    <span><strong>Doors</strong></span>: 5-door      </li>, <li>    <span><strong>Seats</strong></span>: 5       </li>, <li>    <span><strong>Trim</strong></span>: Black        </li>, <li>    <span><strong>VIN</strong></span>: LSJA24U92LN012249     </li>, <li>    <span><strong>Registration</strong></span>: EIT61T       </li>, <li>    <span><strong>Stock Number</strong></span>: UP4053       </li>, <li>    <span><strong>MY</strong></span>: 20     </li>]

or get just the names as list:

names = [x.text for x in soup.select(".stockItemInfo li strong")]

--> ['Vehicle', 'Series', 'Badge', 'Colour', 'Odometer', 'Body', 'Engine', 'Fuel Type', 'Transmission', 'Doors', 'Seats', 'Trim', 'VIN', 'Registration', 'Stock Number', 'MY']

To get a list of dicts with names and values

In case you like to post process, push to pd.DataFrame(data), ...

data = []
for x in soup.select(".stockItemInfo li"):
    item = x.text.strip().split(':')
    data.append({
        'name': item[0],
        'value': item[1]
    })
    
data

Output

 [{'name': 'Vehicle', 'value': ' 2020 MG HS'},
 {'name': 'Series', 'value': ' SAS23 MY20'},
 {'name': 'Badge', 'value': ' Vibe DCT FWD'},
 {'name': 'Colour', 'value': ' White'},
 {'name': 'Odometer', 'value': ' 11,213kms'},
 {'name': 'Body', 'value': ' Wagon'},
 {'name': 'Engine', 'value': ' 1.5 litre, 4-cylinder'},
 {'name': 'Fuel Type', 'value': ' Petrol'},
 {'name': 'Transmission', 'value': ' 7-speed Automatic'},
 {'name': 'Doors', 'value': ' 5-door'},
 {'name': 'Seats', 'value': ' 5'},
 {'name': 'Trim', 'value': ' Black'},
 {'name': 'VIN', 'value': ' LSJA24U92LN012249'},
 {'name': 'Registration', 'value': ' EIT61T'},
 {'name': 'Stock Number', 'value': ' UP4053'},
 {'name': 'MY', 'value': ' 20'}]

CodePudding user response:

The minimal working solution, so far:

Code

import bs4, requests
import pandas as pd

wagon_stock_url = 'https://parramattamg.com.au/up4053-961230-mg-hs-2020.html'

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}


response = requests.get(wagon_stock_url, headers = headers)
soup = bs4.BeautifulSoup(response.text, 'html.parser')  

data=[]
names = soup.select(".stockItemInfo > ul >li")
for name in names:
    name= name.get_text(strip=True).split(':')
    Name= name[0]
    Value= name[1]
    data.append([Name,Value])

cols=["Name","Value"]
df = pd.DataFrame(data,columns=cols)
print(df)
#df.to_csv('info.csv',index=False)  #to store data in your system

Output:

          Name                   Value
0        Vehicle              2020 MG HS
1         Series              SAS23 MY20
2          Badge            Vibe DCT FWD
3         Colour                   White
4       Odometer               11,213kms
5           Body                   Wagon
6         Engine   1.5 litre, 4-cylinder
7      Fuel Type                  Petrol
8   Transmission       7-speed Automatic
9          Doors                  5-door
10         Seats                       5
11          Trim                   Black
12           VIN       LSJA24U92LN012249
13  Registration                  EIT61T
14  Stock Number                  UP4053
15            MY                      20
  • Related