Home > Back-end >  Reading Data from URL into a Pandas Dataframe
Reading Data from URL into a Pandas Dataframe

Time:02-03

I have a URL that I am having difficulty reading. It is uncommon in the sense that it is data that I have self-generated or in other words have created using my own inputs. I have tried with other queries to use something like this and it works fine but not in this case:

bst = pd.read_csv('https://psl.noaa.gov/data/correlation/censo.data', skiprows=1, 
skipfooter=2,index_col=[0], header=None,
             engine='python', # c engine doesn't have skipfooter
             delim_whitespace=True)

Here is the code URL that is providing the challenge:

zwnd = pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl? 
ntype=1&var=Zonal Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
                 engine='python', # c engine doesn't have skipfooter
                 delim_whitespace=True)

Thank you for any help that you can provide.

Here is the full error message:

pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
                 engine='python', # c engine doesn't have skipfooter
                 delim_whitespace=True)
Traceback (most recent call last):

  Cell In[240], line 1
    pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\util\_decorators.py:211 in wrapper
    return func(*args, **kwargs)

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\util\_decorators.py:331 in wrapper
    return func(*args, **kwargs)

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\readers.py:950 in read_csv
    return _read(filepath_or_buffer, kwds)

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\readers.py:611 in _read
    return parser.read(nrows)

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\readers.py:1778 in read
    ) = self._engine.read(  # type: ignore[attr-defined]

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\python_parser.py:282 in read
    alldata = self._rows_to_cols(content)

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\python_parser.py:1045 in _rows_to_cols
    self._alert_malformed(msg, row_num   1)

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\python_parser.py:765 in _alert_malformed
    raise ParserError(msg)

ParserError: Expected 2 fields in line 133, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

CodePudding user response:

Try to do something like this

import io
import requests
import pandas as pd

url = "YOUR_URL.csv"
content = requests.get(url).content
df = pd.read_csv(io.StringIO(content.decode('utf-8')))

CodePudding user response:

pd.read_csv does not parse HTML. You might try pd.read_html, but would find that it works on table tags, not pre tags. You'll need to install something like requests to get the page content, and BeautifulSoup4 to parse the HTML page contents, and then pull out the content of the <pre> tag. Minimal working code:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create Timeseries'
res = requests.get(url)
soup = BeautifulSoup(res.content, "html5lib")
df = DataFrame(soup.find('pre').text.split("\n")[1:-5])

print(df.head(5))

results in

                                                   0
0  1948    0.878    0.779    0.851    0.393    0....
1  1949    0.386    1.197    1.154    1.054    0....
2  1950    0.674    0.973    1.640    0.821    0....
3  1951    1.524    0.698    0.971    0.790    0....
4  1952    1.524    1.510    1.353    0.705    0....

This SO question/answer is a good starting point for what you're trying to accomplish: link

CodePudding user response:

The error message "Expected 2 fields in line 133, saw 3" means that the Pandas read_csv function is encountering a line in the data that it is trying to parse as a CSV file, where there are more columns (3) than expected (2). This is likely because there is a mistake in the format of the file, such as an extra comma in one of the lines, which is causing the columns to be split incorrectly

you may try using copy and paste into pd.read_clipboard

 pd.read_clipboard(sep="\s ",header=None,skiprows=1)
  • Related