list index out of range while web scraping multiple pages-CodePudding

I am trying to web scrape over 600 listings of a real state website. The name, price, area and valueperm2 are mandatory and all the pages have them, so it was easy to scrape them. But other features like amount of rooms, suites, garage and taxes prices are not mandatory and then I get a flexible length and order for elements on soup.findAll('h6',class_ ='mb-0 text-normal').

I've tried to create keys and values to store on the data dictionary but when I tried with k2 and v2 got the out of index, probably because there are only one optional features for some of the listings. Thought about using len(soup.findAll('h6',class_ ='mb-0 text-normal')) to create a conditional way to add those optional features, butg

productlinks = []
baseurl = 'https://www.dfimoveis.com.br/'
for x in range(1,40):
  r = requests.get(f'https://www.dfimoveis.com.br/aluguel/df/todos/asa-norte/apartamento?pagina={x}')
  soup = BeautifulSoup(r.content, 'lxml')
  productlist = soup.find_all('li', class_ = 'property-list__item')
  for item in productlist:
    for link in item.find_all('meta',itemprop = 'url'):
        productlinks.append(baseurl   link['content'])
for link in productlinks:
  r = requests.get(link)
  soup = BeautifulSoup(r.content, 'lxml')
  name = soup.find_all('h1', class_ = 'mb-0 font-weight-600 fs-1-5')[0].text.strip()
  price = soup.find_all('small', class_ = 'display-5 text-warning')[2].text.strip()
  area = soup.find_all('small', class_ = 'display-5 text-warning')[0].text.replace("m²","").strip()
  valueperm2 = soup.find_all('small', class_ = 'display-5 text-warning')[1].text.strip()
  k1 = soup.findAll('h6',class_ ='mb-0 text-normal')[0].text.replace('\r\n                                            ','').strip().split(':')[0]
  v1 = soup.findAll('h6',class_ ='mb-0 text-normal')[0].text.replace('\r\n                                            ','').strip().split(':')[1].strip()
  k2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n                                            ','').strip().split(':')[0]
  v2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n                                            ','').strip().split(':')[1].strip()
  data = {'name':name,
    'value':value,
    'area':area,
    'valueperm2':valueperm2,
     k1:v1,
     k2:v2
    }

and then I get the output

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-74-6ee7d6edeb81> in <module>
      9   v1 = soup.findAll('h6',class_ ='mb-0 text-normal')[0].text.replace('\r\n                                            ','').strip().split(':')[1].strip()
     10   k2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n                                            ','').strip().split(':')[0]
---> 11   v2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n                                            ','').strip().split(':')[1].strip()
     12 
     13   ap = {'name':name,

IndexError: list index out of range

CodePudding user response：

I tried to run your code and am not able to reproduce the problems as I do not have 'baseUrl'.

However, you should be able to check for the length of "soup.findAll('h6',class_ ='mb-0 text-normal')" before assigning the individual items of the list into the v1, k2, v2 (etc) variables.

For example,

results = soup.findAll('h6',class_ ='mb-0 text-normal')
if len(results) >= 2:
  v1 = results[0].text.replace('\r\n                                            ','').strip().split(':')[1].strip()
  k2 = results[1].text.replace('\r\n                                            ','').strip().split(':')[0]
  v2 = results[1].text.replace('\r\n                                            ','')

You will likely to need to reorder this or amend this based on the specific logic you are implementing, but code along these lines should work.

CodePudding user response：

This error happened due to following reason:

You want to extract text using ':'
And expected the length of the splitted data should be 2 (index 0 & 1)
'name:roy' -> ['name','roy'] Will work fine
'name' -> ['name'] Index 1 not available causing IndexError

A seperate function for extracting dynamic field from the page will be a better option to avoid (code repetative, unwanted error)

def dynamic_portion(soup):
    temp_data = {}
    for item in soup.findAll('h6',class_ ='mb-0 text-normal'):
        item = item.text.split(':')
        if len(item)==2:
            key,val = map(str.strip,item)
            temp_data[key]=val
    return temp_data

You can integrate it in your code like the following way:

productlinks = []
baseurl = 'https://www.dfimoveis.com.br/'
for x in range(1,40):
  r = requests.get(f'https://www.dfimoveis.com.br/aluguel/df/todos/asa-norte/apartamento?pagina={x}')
  soup = BeautifulSoup(r.content, 'lxml')
  productlist = soup.find_all('li', class_ = 'property-list__item')
  for item in productlist:
    for link in item.find_all('meta',itemprop = 'url'):
        productlinks.append(baseurl   link['content'])

for link in productlinks:
    r = requests.get(link)
    soup = BeautifulSoup(r.content, 'lxml')
    name = soup.find_all('h1', class_ = 'mb-0 font-weight-600 fs-1-5')[0].text.strip()
    value = 1
    price = soup.find_all('small', class_ = 'display-5 text-warning')[2].text.strip()
    area = soup.find_all('small', class_ = 'display-5 text-warning')[0].text.replace("m²","").strip()
    valueperm2 = soup.find_all('small', class_ = 'display-5 text-warning')[1].text.strip()
    data = {'name':name,
            'value':value,
            'area':area,
            'valueperm2':valueperm2
            }
    temp_data = dynamic_portion(soup)
    data.update(temp_data)
    break