I have a URL that I am having difficulty reading. It is uncommon in the sense that it is data that I have self-generated or in other words have created using my own inputs. I have tried with other queries to use something like this and it works fine but not in this case:
bst = pd.read_csv('https://psl.noaa.gov/data/correlation/censo.data', skiprows=1,
skipfooter=2,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
delim_whitespace=True)
Here is the code URL that is providing the challenge:
zwnd = pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?
ntype=1&var=Zonal Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
delim_whitespace=True)
Thank you for any help that you can provide.
Here is the full error message:
pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
delim_whitespace=True)
Traceback (most recent call last):
Cell In[240], line 1
pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\util\_decorators.py:211 in wrapper
return func(*args, **kwargs)
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\util\_decorators.py:331 in wrapper
return func(*args, **kwargs)
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\readers.py:950 in read_csv
return _read(filepath_or_buffer, kwds)
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\readers.py:611 in _read
return parser.read(nrows)
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\readers.py:1778 in read
) = self._engine.read( # type: ignore[attr-defined]
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\python_parser.py:282 in read
alldata = self._rows_to_cols(content)
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\python_parser.py:1045 in _rows_to_cols
self._alert_malformed(msg, row_num 1)
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\python_parser.py:765 in _alert_malformed
raise ParserError(msg)
ParserError: Expected 2 fields in line 133, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
CodePudding user response:
Try to do something like this
import io
import requests
import pandas as pd
url = "YOUR_URL.csv"
content = requests.get(url).content
df = pd.read_csv(io.StringIO(content.decode('utf-8')))
CodePudding user response:
pd.read_csv does not parse HTML. You might try pd.read_html, but would find that it works on table
tags, not pre
tags. You'll need to install something like requests to get the page content, and BeautifulSoup4 to parse the HTML page contents, and then pull out the content of the <pre>
tag. Minimal working code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create Timeseries'
res = requests.get(url)
soup = BeautifulSoup(res.content, "html5lib")
df = DataFrame(soup.find('pre').text.split("\n")[1:-5])
print(df.head(5))
results in
0
0 1948 0.878 0.779 0.851 0.393 0....
1 1949 0.386 1.197 1.154 1.054 0....
2 1950 0.674 0.973 1.640 0.821 0....
3 1951 1.524 0.698 0.971 0.790 0....
4 1952 1.524 1.510 1.353 0.705 0....
This SO question/answer is a good starting point for what you're trying to accomplish: link
CodePudding user response:
The error message "Expected 2 fields in line 133, saw 3" means that the Pandas read_csv function is encountering a line in the data that it is trying to parse as a CSV file, where there are more columns (3) than expected (2). This is likely because there is a mistake in the format of the file, such as an extra comma in one of the lines, which is causing the columns to be split incorrectly
you may try using copy and paste into pd.read_clipboard
pd.read_clipboard(sep="\s ",header=None,skiprows=1)