First of all apologies for my lack of coding knowledge here!! Any pointers would be greatly appreciated!!
This is the code I have written to read the data from NDBC weather buoy 41049. My plan here is to get the data which works, clean out the spaces and replace them with commas and then create a dataframe from csv. the string_data prints very well from python but the data frame only comes out as one column?
import requests
from bs4 import BeautifulSoup
import re
import shlex
import pandas as pd
import io
file = requests.get("https://www.ndbc.noaa.gov/data/realtime2/41049.spec")
swell_data = BeautifulSoup(file.content, "html.parser")
string_data = swell_data.get_text("swell_data")
re.sub("\string_data ", ",", string_data.strip())
','.join(shlex.split(string_data))
print (string_data)
df = pd.read_csv(io.StringIO(string_data), sep=",")
df
CodePudding user response:
Pandas can read directly from a url. And is prepared to read white separated columns with one or many spaces. Besides that you may want to skip the first row in the file:
import pandas as pd
data = pd.read_csv(
"https://www.ndbc.noaa.gov/data/realtime2/41049.spec",
delim_whitespace=True, skiprows=1)
data
#yr mo dy hr mn m m.1 sec m.2 sec.1 - degT -.1 sec.2 degT.1
0 2021 9 22 20 40 2.8 2.1 12.9 1.8 7.1 NE NE SWELL 7.3 37
1 2021 9 22 19 40 2.7 1.7 13.8 2.0 8.3 NE NE AVERAGE 6.8 39
2 2021 9 22 18 40 2.8 2.2 13.8 1.7 6.7 NNE ENE SWELL 7.0 32
3 2021 9 22 17 40 2.5 1.6 13.8 2.0 6.7 NE ENE AVERAGE 6.6 39
4 2021 9 22 16 40 2.4 1.4 13.8 2.0 7.1 NE ENE AVERAGE 6.2 34
... ... .. .. .. .. ... ... ... ... ... ... ... ... ... ...
1094 2021 8 8 4 40 1.2 1.1 7.7 0.5 4.8 ENE E AVERAGE 5.8 59
1095 2021 8 8 3 40 1.3 1.2 8.3 0.5 4.8 ENE E AVERAGE 5.9 75
1096 2021 8 8 2 40 1.3 1.2 7.7 0.5 4.3 E ENE AVERAGE 5.9 79
1097 2021 8 8 1 40 1.3 1.2 8.3 0.5 4.3 E E AVERAGE 6.1 85
1098 2021 8 8 0 40 1.4 1.3 9.1 0.6 4.8 E E SWELL 6.1 91
[1099 rows x 15 columns]
CodePudding user response:
This does a pretty good job. If it were me, I'd want to combine those first 5 columns into a datetime value, but that's an exercise for the reader. You PROBABLY want to delete the first or second line; pandas doesn't handle two lines of headers.
import requests
import pandas as pd
import io
import re
data = requests.get("https://www.ndbc.noaa.gov/data/realtime2/41049.spec")
swell_data = data.content
string_data = re.sub(" "," ", swell_data.decode('ascii'))
print(string_data)
df = pd.read_csv(io.StringIO(string_data), sep=" ")
print(df)
Output:
#YY MM DD hh mm WVHT SwH SwP WWH WWP SwD WWD STEEPNESS APD MWD
0 #yr mo dy hr mn m m sec m sec - degT - sec degT
1 2021 09 22 20 40 2.8 2.1 12.9 1.8 7.1 NE NE SWELL 7.3 37
2 2021 09 22 19 40 2.7 1.7 13.8 2.0 8.3 NE NE AVERAGE 6.8 39
3 2021 09 22 18 40 2.8 2.2 13.8 1.7 6.7 NNE ENE SWELL 7.0 32
4 2021 09 22 17 40 2.5 1.6 13.8 2.0 6.7 NE ENE AVERAGE 6.6 39
... ... .. .. .. .. ... ... ... ... ... ... ... ... ... ...
1095 2021 08 08 04 40 1.2 1.1 7.7 0.5 4.8 ENE E AVERAGE 5.8 59
1096 2021 08 08 03 40 1.3 1.2 8.3 0.5 4.8 ENE E AVERAGE 5.9 75
1097 2021 08 08 02 40 1.3 1.2 7.7 0.5 4.3 E ENE AVERAGE 5.9 79
1098 2021 08 08 01 40 1.3 1.2 8.3 0.5 4.3 E E AVERAGE 6.1 85
1099 2021 08 08 00 40 1.4 1.3 9.1 0.6 4.8 E E SWELL 6.1 91
[1100 rows x 15 columns]