Home > Net >  pd.read_html() not reading date
pd.read_html() not reading date

Time:09-27

When I try to parse a wiki page for its tables, the tables are read correctly except for the date of birth column, which comes back as empty. Is there a workaround for this? I've tried using beautiful soup but I get the same result. The code I've used is as follows:

url = 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads'
pd.read_html(url)

Here's an image of one of the tables in question: enter image description here

CodePudding user response:

One possible solution can be alter the page content with beautifulsoup and then load it to pandas:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

# select correct table, here I select the first one:
tbl = soup.select("table")[0]

# remove the (aged XX) part:
for td in tbl.select("td:nth-of-type(3)"):
    td.string = td.contents[-1].split("(")[0]

df = pd.read_html(str(tbl))[0]
print(df)

Prints:

    No. Pos.               Player Date of birth (age)  Caps               Club
0     1   GK      Thomas Sørensen        12 June 1976    14         Sunderland
1     2   MF         Stig Tøfting      14 August 1969    36   Bolton Wanderers
2     3   DF       René Henriksen      27 August 1969    39      Panathinaikos
3     4   DF       Martin Laursen        26 July 1977    15              Milan
4     5   DF      Jan Heintze (c)      17 August 1963    83      PSV Eindhoven
5     6   DF        Thomas Helveg        24 June 1971    67              Milan
6     7   MF      Thomas Gravesen       11 March 1976    22            Everton
7     8   MF      Jesper Grønkjær      12 August 1977    25            Chelsea
8     9   FW    Jon Dahl Tomasson      29 August 1976    38          Feyenoord
9    10   MF     Martin Jørgensen      6 October 1975    32            Udinese
10   11   FW            Ebbe Sand        19 July 1972    44         Schalke 04
11   12   DF        Niclas Jensen      17 August 1974     8    Manchester City
12   13   DF         Steven Lustü       13 April 1971     4                Lyn
13   14   MF         Claus Jensen       29 April 1977    13  Charlton Athletic
14   15   MF       Jan Michaelsen    28 November 1970    11      Panathinaikos
15   16   GK           Peter Kjær     5 November 1965     4           Aberdeen
16   17   MF    Christian Poulsen    28 February 1980     3         Copenhagen
17   18   FW    Peter Løvenkrands     29 January 1980     4            Rangers
18   19   MF     Dennis Rommedahl        22 July 1978    19      PSV Eindhoven
19   20   DF      Kasper Bøgelund      8 October 1980     2      PSV Eindhoven
20   21   FW         Peter Madsen       26 April 1978     4            Brøndby
21   22   GK  Jesper Christiansen       24 April 1978     0              Vejle
22   23   MF  Brian Steen Nielsen    28 December 1968    65           Malmö FF

CodePudding user response:

Try setting the parse_dates parameter to True inside read_html method.

  • Related