Read and tabulate table-like data from website-CodePudding

I want to tabulate and store into Pandas this linked data from the U.S. Weather Service.

Here are the first four lines of the webpage.

Historic Crests

(1) 43.28 ft on 04/17/1979

(2) 39.58 ft on 05/25/1983

(3) 36.67 ft on 02/17/2020

You can access the data from an IDE or notebook using the following code.

import bs4
import urllib.request
link = "https://water.weather.gov/ahps2/crests.php?wfo=jan&gage=jacm6&crest_type=historic"

webpage=str(urllib.request.urlopen(link).read())
soup = bs4.BeautifulSoup(webpage)

print(soup.get_text())

The data is already in a table-like structure, and I think you could tabulate it by parsing it into a dictionary of lists and then uploading it into a Pandas Dataframe. However, I imagine that there is a more simple pythonic approach.

Here's a snippet of the desired table structure.

No.	Crest	Date
1	43.28	04/17/1979
2	39.58	05/25/1983
3	36.67	02/17/2020

CodePudding user response：

You could create a regex pattern and feed it to Series.str.extractall. E.g.:

tbl = soup.find('div', class_='water_information')
vals = tbl.get_text().split(r'\n')

df = pd.Series(vals).str.extractall(r'\((?P<No>\d )\)\s(?P<Crest>\d .\d )\sft\son\s(?P<Date>\d{2}\/\d{2}\/\d{4})')\
    .reset_index(drop=True)

print(df)

    No  Crest        Date
0    1  43.28  04/17/1979
1    2  39.58  05/25/1983
2    3  36.67  02/17/2020
3    4  36.30  03/31/1902
4    5  36.30  12/05/1880
..  ..    ...         ...
86  87  30.30  04/20/1955
87  88  30.26  01/25/1983
88  89  30.26  04/07/1981
89  90  30.25  01/24/2020
90  91  30.23  12/23/1967

And maybe change the dtypes at this point:

df['No'] = df.No.astype(int)
df['Crest'] = df.Crest.astype(float)
df['Date'] = pd.to_datetime(df.Date)