Home > Blockchain >  Convert scraped HTML table to pandas dataframe
Convert scraped HTML table to pandas dataframe

Time:01-13

I would like to scrape a webpage containing data regarding all trains that have arrived and departed from Amsterdam Central station on a specific date, and convert it into a pandas dataframe.

I know I can scrape the webpage and convert it into a pandas dataframe like so (see below), but this doesn't give me the correct table.

import pandas as pd
import requests

url = 'https://www.rijdendetreinen.nl/en/train-archive/2022-01-12/amsterdam-centraal'
response = requests.get(url).content

dfs = pd.read_html(response)
dfs[1]

What I like to achieve is one pandas dataframe containing all data that's under the header "Train services" of the webpage, like so:

Arrival    Departure  Destination          Type        Train   Platform   Composition
02:44  2½  02:46  2   Rotterdam Centraal   Intercity   1409    4a         VIRM-4 9416
03:17  5   03:19  4   Utrecht Centraal     Intercity   1410    7a         ICM-3 4086
03:44      03:46      Rotterdam Centraal   Intercity   1413    7a         ICM-3 4014
04:17      04:19      Utrecht Centraal     Intercity   1414    7a         ICM-4 4216
04:44      04:46      Rotterdam Centraal   Intercity   1417    7a         ICM-3 4086
...        ...        ...                  ...         ...     ...        ...

I hope there's someone able to help me with that.

CodePudding user response:

pd.read_html needs a <table> to populate the dtaframe, it doesn't exists in the html. I suggest you use BeautifulSoup instead

from bs4 import BeautifulSoup

url = 'https://www.rijdendetreinen.nl/en/train-archive/2022-01-12/amsterdam-centraal'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')

columns = soup.find('div', class_='row header')
columns = columns.text.strip().split('\n\n')

tables = soup.find_all('div', class_='row service')
data = []
for table in tables:
    row = table.find_all('div', recursive=False)
    data.append([cell.text.replace('\n', '') for cell in row[:-1]]   [row[-1].find('span')['title']])

df = pd.DataFrame(data, columns=columns)
print(df)

Output

       Arrival  Departure          Destination        Type   Train  Platform   Composition
0    02:44  2½   02:46  2   Rotterdam Centraal   Intercity    1409        4a   VIRM-4 9416
1     03:17  5   03:19  4     Utrecht Centraal   Intercity    1410        7a    ICM-3 4086
2        03:44      03:46   Rotterdam Centraal   Intercity    1413        7a    ICM-3 4014
3        04:17      04:19     Utrecht Centraal   Intercity    1414        7a    ICM-4 4216
4        04:44      04:46   Rotterdam Centraal   Intercity    1417        7a    ICM-3 4086
..          ...        ...                  ...         ...     ...       ...          ...
784      01:04      01:14              Alkmaar    Sprinter    7384       10a    SLT-4 2465
785           —     01:19     Utrecht Centraal   Intercity    1402        7a   VIRM-6 8648
786      01:44   01:46  1   Rotterdam Centraal   Intercity   11405       13a    ICM-4 4239
787      02:17      02:19     Utrecht Centraal   Intercity    1406       11a    ICM-4 4247
788      03:17   03:19  1     Utrecht Centraal   Intercity   11410       11a    ICM-3 4076
  • Related