Home > Software engineering >  Web Scraping: how can I add rows in a for loop to dataframe?
Web Scraping: how can I add rows in a for loop to dataframe?

Time:03-23

I would like to scrape a table from the URLs below. The scraping works but the problem I have is that it only shows the information from the first URL. How can I fix my code so that it adds the information of the second URL as well? I hope my question is clear.

import pandas as pd
import requests
from bs4 import BeautifulSoup

urls = ['https://www.funda.nl/en/koop/ridderkerk/huis-42649106-natalstraat-15/', 'https://www.funda.nl/en/en/koop/rotterdam/huis-42648673-courzandseweg-67/']

#df = pd.DataFrame()

dl = []# Storage for data
dt = []# Storage for column names

for url in urls:
    headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",}
    soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")

    dl_data = soup.find_all("dd") # Scraping the data
    for dlitem in dl_data:
        dl.append(dlitem.text.strip())

    dt_data = soup.find_all("dt") # Scraping the column names
    for dtitem in dt_data:
        dt.append(dtitem.text.strip())


df = pd.DataFrame(dl) # Creating the dataframe

df = df.T # Transposing it because otherwise it is 1D
df.columns = dt # Giving the column names to the dataframe

CodePudding user response:

Avoid the multiple lists, just choose a more leaner approached to process your data and save in more structured way e.g. dict - These dict comprehension selects all <dd> that follows an <dt> creates a dict and appends it to data. Simply create a DataFrame from this list of dicts:

data.append({e.find_previous_sibling('dt').text.strip(): e.text.strip() for e in soup.select('dt   dd')})

Example

import pandas as pd
import requests
from bs4 import BeautifulSoup

urls = ['https://www.funda.nl/en/koop/ridderkerk/huis-42649106-natalstraat-15/', 'https://www.funda.nl/en/en/koop/rotterdam/huis-42648673-courzandseweg-67/']
data = []

for url in urls:
    headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",}
    soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")

    data.append({e.find_previous_sibling('dt').text.strip(): e.text.strip() for e in soup.select('dt   dd')})

pd.DataFrame(data)

CodePudding user response:

It looks like dl and dt do not have an equal amount of elements (75 and 71 respectively). Because of this, you can't use dt for column names. You could fix that by adding padding (for example initialising the dt list with zeros) or by removing unnecessary elements in the dl list.

  • Related