I am trying to read with pandas the file you find here. I saved in the local directory. I am forced to use Python 3.6
import requests
r = requests.get('https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/TU_Stundenwerte_Beschreibung_Stationen.txt')
with open('DWD_weather_stations.txt','w') as fd:
fd.write(r.text)
First I tried this:
weather_stations = pd.read_csv("DWD_weather_stations.txt",
sep="\s ",
header=0,
skiprows=[1],
skipinitialspace=True,
engine='python')
The error I got was the following:
ParserError: Error tokenizing data. C error: Expected 8 fields in line 10, saw 9
In the beginning I was not getting what the problem was and I tried a different separator:
weather_stations = pd.read_csv("DWD_weather_stations.txt", sep='\s{2,}',header=[0], skiprows=[1], engine='python')
Now something weird happened: All the columns except from the last one were recognized as a index and the last column was recognized as the only column. I thought back about the previous error and I saw at the 9th line (the last one above) there was an extra single space in the column named Stationsname (the semi last).
Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge Stationsname Bundesland
----------- --------- --------- ------------- --------- --------- ----------------------------------------- ----------
00003 19500401 20110331 202 50.7827 6.0941 Aachen Nordrhein-Westfalen
00044 20070401 20211221 44 52.9336 8.2370 Großenkneten Niedersachsen
00052 19760101 19880101 46 53.6623 10.1990 Ahrensburg-Wulfsdorf Schleswig-Holstein
00071 20091201 20191231 759 48.2156 8.9784 Albstadt-Badkap Baden-Württemberg
00073 20070401 20211221 340 48.6159 13.0506 Aldersbach-Kriestorf Bayern
00078 20041101 20211221 65 52.4853 7.9126 Alfhausen Niedersachsen
00091 20040901 20211221 300 50.7446 9.3450 Alsfeld-Eifa Hessen
00096 20190409 20211221 50 52.9437 12.8518 Neuruppin-Alt Ruppin Brandenburg
00102 20020101 20211221 32 53.8633 8.1275 Leuchtturm Alte Weser Niedersachsen
Also other column present the same problem: I tried using the skiprows argument but it doesn't work properly. The number of lines changes after I change the rows to skip, I guess it is not recognizing properly the row ending.
One brutal solution would be to manually modify the rows and convert the wrong white spaces in the values with dashes. But I was wondering if there was a better solution.
CodePudding user response:
I think that there is no easy solution because the file is not consistent with any convention of the csv. If we consider the white spaces as separator the first three columns use only one white space, but this character can be contained into the last two columns. So the white space cannot be used as separator.
The correct separator is the number of characters in each column that is constant. Unfortunately pandas doesn't provide a way to specify this kind of separators. My advice is to download the file as txt, read it and split the columns given the number of character of each column. Then create the csv and read it easily with pandas.
First of all I identified the length of the columns
columns = [5, 9, 9, 15, 12, 11, 41, 41]
Then I transform the txt into csv splitting each line
from itertools import islice
csv_file = []
with open('DWD_weather_stations.txt') as file:
for i, row in enumerate(file.readlines()):
if i == 0:
headers = row.split()
csv_file.append(";".join(headers))
continue
elif i == 1:
continue
iter_row = iter(row)
csv_row = ';'.join(''.join(islice(iter_row, None, x)).strip() for x in columns)
csv_file.append(csv_row)
with open('DWD_weather_stations.csv', "w") as file:
file.write("\n".join(csv_file))
Several comments:
- the first line is the header and it is analyzed separately since as separator can be use easily the white space and the number of characters for each column are not the same of the other rows
- The second row is useless for this reason I skipped
- The separators used for the CSV are
;
for the column and\n
for the row