I would like to scrape a webpage containing data regarding all trains that have arrived and departed from Amsterdam Central station on a specific date, and convert it into a pandas dataframe.
I know I can scrape the webpage and convert it into a pandas dataframe like so (see below), but this doesn't give me the correct table.
import pandas as pd
import requests
url = 'https://www.rijdendetreinen.nl/en/train-archive/2022-01-12/amsterdam-centraal'
response = requests.get(url).content
dfs = pd.read_html(response)
dfs[1]
What I like to achieve is one pandas dataframe containing all data that's under the header "Train services" of the webpage, like so:
Arrival Departure Destination Type Train Platform Composition
02:44 2½ 02:46 2 Rotterdam Centraal Intercity 1409 4a VIRM-4 9416
03:17 5 03:19 4 Utrecht Centraal Intercity 1410 7a ICM-3 4086
03:44 03:46 Rotterdam Centraal Intercity 1413 7a ICM-3 4014
04:17 04:19 Utrecht Centraal Intercity 1414 7a ICM-4 4216
04:44 04:46 Rotterdam Centraal Intercity 1417 7a ICM-3 4086
... ... ... ... ... ... ...
I hope there's someone able to help me with that.
CodePudding user response:
pd.read_html
needs a <table>
to populate the dtaframe, it doesn't exists in the html. I suggest you use BeautifulSoup
instead
from bs4 import BeautifulSoup
url = 'https://www.rijdendetreinen.nl/en/train-archive/2022-01-12/amsterdam-centraal'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
columns = soup.find('div', class_='row header')
columns = columns.text.strip().split('\n\n')
tables = soup.find_all('div', class_='row service')
data = []
for table in tables:
row = table.find_all('div', recursive=False)
data.append([cell.text.replace('\n', '') for cell in row[:-1]] [row[-1].find('span')['title']])
df = pd.DataFrame(data, columns=columns)
print(df)
Output
Arrival Departure Destination Type Train Platform Composition
0 02:44 2½ 02:46 2 Rotterdam Centraal Intercity 1409 4a VIRM-4 9416
1 03:17 5 03:19 4 Utrecht Centraal Intercity 1410 7a ICM-3 4086
2 03:44 03:46 Rotterdam Centraal Intercity 1413 7a ICM-3 4014
3 04:17 04:19 Utrecht Centraal Intercity 1414 7a ICM-4 4216
4 04:44 04:46 Rotterdam Centraal Intercity 1417 7a ICM-3 4086
.. ... ... ... ... ... ... ...
784 01:04 01:14 Alkmaar Sprinter 7384 10a SLT-4 2465
785 — 01:19 Utrecht Centraal Intercity 1402 7a VIRM-6 8648
786 01:44 01:46 1 Rotterdam Centraal Intercity 11405 13a ICM-4 4239
787 02:17 02:19 Utrecht Centraal Intercity 1406 11a ICM-4 4247
788 03:17 03:19 1 Utrecht Centraal Intercity 11410 11a ICM-3 4076