I have csv data that looks like this and I'm trying to read it into a pandas df and I've tired all sorts of combinations given the ample documentation online - I've tried things like:
pd.read_csv("https://www.nwrfc.noaa.gov/natural/nat_norm_text.cgi?id=TDAO3.csv", delimiter=',', skiprows=0, low_memory=False)
and I get this error -
ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 989
Or, like this but get an empty dataframe:
pd.read_csv('https://www.nwrfc.noaa.gov/natural/nat_norm_text.cgi?id=TDAO3.csv', skiprows=2,
skipfooter=3,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
sep='delimiter')
Out[31]:
Empty DataFrame
Columns: []
Index: []
The first 10 lines of the csv file look like this:
# Water Supply Monthly Volumes for COLUMBIA - THE DALLES DAM (TDAO3)
# Volumes are in KAF
ID,Calendar Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
TDAO3,1948,,,,,,,,,,6866.8,4307.04,4379.38
TDAO3,1949,3546.71,4615.1,8513.31,15020.45,35251.67,21985.99,11226.06,6966.73,4727.37,4406.29,5266.74,5595.91
TDAO3,1950,4353.86,5540.21,9696.27,12854.81,23359.51,39246.78,23393.23,9676.77,5729.74,6990.31,8300.03,8779.57
TDAO3,1951,8032.32,10295.98,7948.59,16144.8,36000.88,28334.09,19735.49,9308.15,6546.95,8907.1,6461.14,6425.76
TDAO3,1952,4671,6222.25,6551.62,18678.3,34866.91,27120.65,15994.18,7907.55,4810.39,3954.32,3259.29,3231.49
TDAO3,1953,7839.72,7870.96,6527.74,9474.66,23384.47,32668.32,17422.63,8655.16,5220.04,5130.46,5183.5,5915.14
TDAO3,1954,5197.51,5967.07,6718.36,10813.69,29190.37,32673.26,29624.38,13456.13,9165.78,5440.92,5732.22,4973.53
thank you,
CodePudding user response:
It is not link directly to file CSV but to page which displays it as HTML using tags <pre>
, <br>
, etc. and this makes problem.
But you can use requests
to download it as text.
Later you can use standard string
-functions to get text between <pre>
and </pre>
and replace <br>
with '\n'
- and you will have text with correct CSV.
And later you can use io.StringIO
to create file in memory - to load it with pd.read_csv()
without saving on disk.
import pandas as pd
import requests
import io
url = "https://www.nwrfc.noaa.gov/natural/nat_norm_text.cgi?id=TDAO3.csv"
response = requests.get(url)
start = response.text.find('<pre>') len('<pre>')
end = response.text.find('</pre>')
pre = response.text[start:end]
text = pre.replace('<br>', '\n')
buf = io.StringIO(text) # file-like object in memory
df = pd.read_csv(buf, skiprows=2, low_memory=False)
print(df.to_string())
Result
ID Calendar Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 TDAO3 1948 NaN NaN NaN NaN NaN NaN NaN NaN NaN 6866.80 4307.04 4379.38
1 TDAO3 1949 3546.71 4615.10 8513.31 15020.45 35251.67 21985.99 11226.06 6966.73 4727.37 4406.29 5266.74 5595.91
2 TDAO3 1950 4353.86 5540.21 9696.27 12854.81 23359.51 39246.78 23393.23 9676.77 5729.74 6990.31 8300.03 8779.57
3 TDAO3 1951 8032.32 10295.98 7948.59 16144.80 36000.88 28334.09 19735.49 9308.15 6546.95 8907.10 6461.14 6425.76
4 TDAO3 1952 4671.00 6222.25 6551.62 18678.30 34866.91 27120.65 15994.18 7907.55 4810.39 3954.32 3259.29 3231.49
5 TDAO3 1953 7839.72 7870.96 6527.74 9474.66 23384.47 32668.32 17422.63 8655.16 5220.04 5130.46 5183.50 5915.14
6 TDAO3 1954 5197.51 5967.07 6718.36 10813.69 29190.37 32673.26 29624.38 13456.13 9165.78 5440.92 5732.22 4973.53
7 TDAO3 1955 4124.26 3570.41 3843.46 7993.82 18505.47 31619.54 20408.54 8922.94 4983.31 5842.70 6982.45 9076.44
8 TDAO3 1956 8079.70 5366.62 8818.69 19754.46 40600.06 40447.34 19846.89 9726.93 5503.69 5446.20 4988.98 6006.80
9 TDAO3 1957 3940.08 4411.33 9155.00 12271.77 40111.86 27864.70 11585.75 6795.70 4613.31 4767.38 4087.55 4789.04
10 TDAO3 1958 4838.12 8246.89 7303.03 13902.66 33958.88 26239.62 12516.52 6898.78 4968.03 5198.19 6662.24 7616.43
... rest ...