Home > Back-end >  Strip number from a scraped data and store it in a separate variable
Strip number from a scraped data and store it in a separate variable

Time:10-23

I am using the following code for scraping data. And I wish to extract and store numeric data from the strings.

import requests
import pandas as pd
from bs4 import BeautifulSoup as soup

base_site_1 = "https://www.cartrade.com/buy-used-cars/new-delhi/c#city=10&sc=-1&so=-1&pn=1"

response = requests.get(base_site_1)
response.status_code

html = response.content
html[:100]

html5lib_soup = soup(html, 'html5lib')

with open('cartrade_used_cars_mumbai_html5lib_Parser.html', 'wb') as file:
    file.write(html5lib_soup.prettify('utf-8'))

html5lib_soup

model = []

for i in html5lib_soup.findAll('div', {'class' : 'grid_cnt_new'}):
    model.append(i.a.text.strip())

model

Output is

['2013 Hyundai Verna Fluidic 1.6 VTVT SX',
 '2021 Land Rover Discovery Sport R-Dynamic SE',
 '2011 Hyundai Santro Xing GLS (CNG)',
 '2016 Porsche Cayenne 3.2 V6 Petrol',
 '2010 Hyundai i10 Era',
 '2011 Honda City V MT CNG Compatible',
 '2016 Mercedes-Benz GLE 250 d',
 '2013 Honda Amaze 1.5 S i-DTEC',
 '2013 Maruti Suzuki Estilo LXi BS-IV',
 '2019 Tata Tiago Revotron XZ',
 '2015 Toyota Innova 2.5 G BS IV 7 STR',
 '2009 Honda City 1.5 S MT',
 '2009 Hyundai Santro Xing GLS',
 '2011 Mahindra Scorpio VLX 4WD Airbag BS-IV',
 '2014 Maruti Suzuki Celerio LXi',
 '2014 Hyundai Grand i10 Magna 1.2 Kappa VTVT [2013-2016]',
 '2016 Hyundai Grand i10 Magna 1.2 Kappa VTVT [2017-2020]',
 '2011 Hyundai i10 Sportz 1.2 Kappa2',
 '2016 Ford EcoSport Titanium 1.5L TDCi',
 '2015 Maruti Suzuki Baleno Delta 1.2',
 '2019 Volkswagen Ameo Trendline 1.2L (P)',
 '2010 Honda City 1.5 S MT',
 '2014 Renault Duster 85 PS RxL Diesel',
 '2014 Honda Brio S MT',
 '2017 Maruti Suzuki Vitara Brezza ZDi  Dual Tone [2017-2018]',
 '2020 Maruti Suzuki Wagon R 1.0 VXI  (O)',
 '2016 Maruti Suzuki Ciaz ZXI  AT',
 '2017 Maruti Suzuki Celerio ZXi [2017-2019]',
 '2011 Toyota Corolla Altis 1.8 G',
 '2012 Honda Brio E MT',
 '2013 Toyota Innova 2.5 GX 7 STR BS-III',
 '2018 Maruti Suzuki Swift Lxi (O) [2014-2017]']

And I want to strip 2013 from '2013 Hyundai Verna Fluidic 1.6 VTVT SX' and I want to do the same for every string and store the number in a separate variable and display the output which is a collection of all the numbers from each string when needed like this

['2013',
 '2021',
 '2011',
 '2016',
 '2010',
 '2011',
 '2016',
 '2013',
 '2013',
 '2019',
 '2015',
 '2009',
 '2009',
 '2011',
 '2014',
 '2014',
 '2016',
 '2011',
 '2016',
 '2015',
 '2019',
 '2010',
 '2014',
 '2014',
 '2017',
 '2020',
 '2016',
 '2017',
 '2011',
 '2012',
 '2013',
 '2018']

CodePudding user response:

You can use a list comprehension to select the first 4 characters of each string and store them in a new list.

years = [car[:4] for car in cars]

If you want to display the results per line, you can do

print(*years, sep="\n")

CodePudding user response:

The desired output is as follows:

import requests
import pandas as pd
from bs4 import BeautifulSoup as soup

base_site_1 = "https://www.cartrade.com/buy-used-cars/new-delhi/c#city=10&sc=-1&so=-1&pn=1"

response = requests.get(base_site_1)
response.status_code

html = response.content
html[:100]

html5lib_soup = soup(html, 'html5lib')

with open('cartrade_used_cars_mumbai_html5lib_Parser.html', 'wb') as file:
    file.write(html5lib_soup.prettify('utf-8'))

html5lib_soup

model = []

for i in html5lib_soup.findAll('div', {'class': 'grid_cnt_new'}):
    year= i.a.get_text(strip=True).split()[0]
    print(year)
    #model.append(year)

#print(model)

Output:

2011
2015
2013
2017
2010
2012
2016
2012
2013
2019
2015
2009
2009
2011
2014
2014
2016
2011
2016
2015
2019
2010
2014
2014
2017
2020
2016
2017
2011
2012
2013
2018
  • Related