I would like to scrape a table from the URLs below. The scraping works but the problem I have is that it only shows the information from the first URL. How can I fix my code so that it adds the information of the second URL as well? I hope my question is clear.
import pandas as pd
import requests
from bs4 import BeautifulSoup
urls = ['https://www.funda.nl/en/koop/ridderkerk/huis-42649106-natalstraat-15/', 'https://www.funda.nl/en/en/koop/rotterdam/huis-42648673-courzandseweg-67/']
#df = pd.DataFrame()
dl = []# Storage for data
dt = []# Storage for column names
for url in urls:
headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",}
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
dl_data = soup.find_all("dd") # Scraping the data
for dlitem in dl_data:
dl.append(dlitem.text.strip())
dt_data = soup.find_all("dt") # Scraping the column names
for dtitem in dt_data:
dt.append(dtitem.text.strip())
df = pd.DataFrame(dl) # Creating the dataframe
df = df.T # Transposing it because otherwise it is 1D
df.columns = dt # Giving the column names to the dataframe
CodePudding user response:
Avoid the multiple lists, just choose a more leaner approached to process your data and save in more structured way e.g. dict
- These dict comprehension
selects all <dd>
that follows an <dt>
creates a dict
and appends it to data
. Simply create a DataFrame
from this list of dicts:
data.append({e.find_previous_sibling('dt').text.strip(): e.text.strip() for e in soup.select('dt dd')})
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
urls = ['https://www.funda.nl/en/koop/ridderkerk/huis-42649106-natalstraat-15/', 'https://www.funda.nl/en/en/koop/rotterdam/huis-42648673-courzandseweg-67/']
data = []
for url in urls:
headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",}
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
data.append({e.find_previous_sibling('dt').text.strip(): e.text.strip() for e in soup.select('dt dd')})
pd.DataFrame(data)
CodePudding user response:
It looks like dl
and dt
do not have an equal amount of elements (75 and 71 respectively). Because of this, you can't use dt
for column names. You could fix that by adding padding (for example initialising the dt
list with zeros) or by removing unnecessary elements in the dl
list.