I want to tabulate and store into Pandas this linked data from the U.S. Weather Service.
Here are the first four lines of the webpage.
Historic Crests
(1) 43.28 ft on 04/17/1979
(2) 39.58 ft on 05/25/1983
(3) 36.67 ft on 02/17/2020
You can access the data from an IDE or notebook using the following code.
import bs4
import urllib.request
link = "https://water.weather.gov/ahps2/crests.php?wfo=jan&gage=jacm6&crest_type=historic"
webpage=str(urllib.request.urlopen(link).read())
soup = bs4.BeautifulSoup(webpage)
print(soup.get_text())
The data is already in a table-like structure, and I think you could tabulate it by parsing it into a dictionary of lists and then uploading it into a Pandas Dataframe. However, I imagine that there is a more simple pythonic approach.
Here's a snippet of the desired table structure.
No. | Crest | Date |
---|---|---|
1 | 43.28 | 04/17/1979 |
2 | 39.58 | 05/25/1983 |
3 | 36.67 | 02/17/2020 |
CodePudding user response:
You could create a regex pattern and feed it to Series.str.extractall
. E.g.:
tbl = soup.find('div', class_='water_information')
vals = tbl.get_text().split(r'\n')
df = pd.Series(vals).str.extractall(r'\((?P<No>\d )\)\s(?P<Crest>\d .\d )\sft\son\s(?P<Date>\d{2}\/\d{2}\/\d{4})')\
.reset_index(drop=True)
print(df)
No Crest Date
0 1 43.28 04/17/1979
1 2 39.58 05/25/1983
2 3 36.67 02/17/2020
3 4 36.30 03/31/1902
4 5 36.30 12/05/1880
.. .. ... ...
86 87 30.30 04/20/1955
87 88 30.26 01/25/1983
88 89 30.26 04/07/1981
89 90 30.25 01/24/2020
90 91 30.23 12/23/1967
And maybe change the dtypes at this point:
df['No'] = df.No.astype(int)
df['Crest'] = df.Crest.astype(float)
df['Date'] = pd.to_datetime(df.Date)