Dataframe comes out with only one column-CodePudding

First of all apologies for my lack of coding knowledge here!! Any pointers would be greatly appreciated!!

This is the code I have written to read the data from NDBC weather buoy 41049. My plan here is to get the data which works, clean out the spaces and replace them with commas and then create a dataframe from csv. the string_data prints very well from python but the data frame only comes out as one column?

import requests
from bs4 import BeautifulSoup
import re
import shlex
import pandas as pd 
import io

file = requests.get("https://www.ndbc.noaa.gov/data/realtime2/41049.spec")
swell_data = BeautifulSoup(file.content, "html.parser")
string_data = swell_data.get_text("swell_data")
re.sub("\string_data ", ",", string_data.strip())
','.join(shlex.split(string_data))
print (string_data)
df = pd.read_csv(io.StringIO(string_data), sep=",")
df

CodePudding user response：

Pandas can read directly from a url. And is prepared to read white separated columns with one or many spaces. Besides that you may want to skip the first row in the file:

import pandas as pd

data = pd.read_csv(
    "https://www.ndbc.noaa.gov/data/realtime2/41049.spec",
    delim_whitespace=True, skiprows=1)
data

       #yr  mo  dy  hr  mn    m  m.1   sec  m.2 sec.1    - degT      -.1  sec.2  degT.1
0     2021   9  22  20  40  2.8  2.1  12.9  1.8   7.1   NE   NE    SWELL    7.3      37
1     2021   9  22  19  40  2.7  1.7  13.8  2.0   8.3   NE   NE  AVERAGE    6.8      39
2     2021   9  22  18  40  2.8  2.2  13.8  1.7   6.7  NNE  ENE    SWELL    7.0      32
3     2021   9  22  17  40  2.5  1.6  13.8  2.0   6.7   NE  ENE  AVERAGE    6.6      39
4     2021   9  22  16  40  2.4  1.4  13.8  2.0   7.1   NE  ENE  AVERAGE    6.2      34
...    ...  ..  ..  ..  ..  ...  ...   ...  ...   ...  ...  ...      ...    ...     ...
1094  2021   8   8   4  40  1.2  1.1   7.7  0.5   4.8  ENE    E  AVERAGE    5.8      59
1095  2021   8   8   3  40  1.3  1.2   8.3  0.5   4.8  ENE    E  AVERAGE    5.9      75
1096  2021   8   8   2  40  1.3  1.2   7.7  0.5   4.3    E  ENE  AVERAGE    5.9      79
1097  2021   8   8   1  40  1.3  1.2   8.3  0.5   4.3    E    E  AVERAGE    6.1      85
1098  2021   8   8   0  40  1.4  1.3   9.1  0.6   4.8    E    E    SWELL    6.1      91

[1099 rows x 15 columns]

CodePudding user response：

This does a pretty good job. If it were me, I'd want to combine those first 5 columns into a datetime value, but that's an exercise for the reader. You PROBABLY want to delete the first or second line; pandas doesn't handle two lines of headers.

import requests
import pandas as pd 
import io
import re

data = requests.get("https://www.ndbc.noaa.gov/data/realtime2/41049.spec")
swell_data = data.content
string_data = re.sub("  "," ", swell_data.decode('ascii'))
print(string_data)
df = pd.read_csv(io.StringIO(string_data), sep=" ")
print(df)

Output:


       #YY  MM  DD  hh  mm WVHT  SwH   SwP  WWH  WWP  SwD   WWD STEEPNESS  APD   MWD
0      #yr  mo  dy  hr  mn    m    m   sec    m  sec    -  degT         -  sec  degT
1     2021  09  22  20  40  2.8  2.1  12.9  1.8  7.1   NE    NE     SWELL  7.3    37
2     2021  09  22  19  40  2.7  1.7  13.8  2.0  8.3   NE    NE   AVERAGE  6.8    39
3     2021  09  22  18  40  2.8  2.2  13.8  1.7  6.7  NNE   ENE     SWELL  7.0    32
4     2021  09  22  17  40  2.5  1.6  13.8  2.0  6.7   NE   ENE   AVERAGE  6.6    39
...    ...  ..  ..  ..  ..  ...  ...   ...  ...  ...  ...   ...       ...  ...   ...
1095  2021  08  08  04  40  1.2  1.1   7.7  0.5  4.8  ENE     E   AVERAGE  5.8    59
1096  2021  08  08  03  40  1.3  1.2   8.3  0.5  4.8  ENE     E   AVERAGE  5.9    75
1097  2021  08  08  02  40  1.3  1.2   7.7  0.5  4.3    E   ENE   AVERAGE  5.9    79
1098  2021  08  08  01  40  1.3  1.2   8.3  0.5  4.3    E     E   AVERAGE  6.1    85
1099  2021  08  08  00  40  1.4  1.3   9.1  0.6  4.8    E     E     SWELL  6.1    91

[1100 rows x 15 columns]