Extra spaces in values (not trailing spaces, without quote) when reading csv with space delimiters-CodePudding

I am trying to read with pandas the file you find here. I saved in the local directory. I am forced to use Python 3.6

import requests
r = requests.get('https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/TU_Stundenwerte_Beschreibung_Stationen.txt')

with open('DWD_weather_stations.txt','w') as fd:
    fd.write(r.text)

First I tried this:

weather_stations = pd.read_csv("DWD_weather_stations.txt",
                                sep="\s ",
                                header=0,  
                                skiprows=[1], 
                                skipinitialspace=True, 
                                engine='python')

The error I got was the following:

ParserError: Error tokenizing data. C error: Expected 8 fields in line 10, saw 9

In the beginning I was not getting what the problem was and I tried a different separator:

weather_stations = pd.read_csv("DWD_weather_stations.txt", sep='\s{2,}',header=[0], skiprows=[1], engine='python')

Now something weird happened: All the columns except from the last one were recognized as a index and the last column was recognized as the only column. I thought back about the previous error and I saw at the 9th line (the last one above) there was an extra single space in the column named Stationsname (the semi last).

Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge Stationsname Bundesland
----------- --------- --------- ------------- --------- --------- ----------------------------------------- ----------
00003 19500401 20110331            202     50.7827    6.0941 Aachen                                   Nordrhein-Westfalen                                                                               
00044 20070401 20211221             44     52.9336    8.2370 Großenkneten                             Niedersachsen                                                                                     
00052 19760101 19880101             46     53.6623   10.1990 Ahrensburg-Wulfsdorf                     Schleswig-Holstein                                                                                
00071 20091201 20191231            759     48.2156    8.9784 Albstadt-Badkap                          Baden-Württemberg                                                                                 
00073 20070401 20211221            340     48.6159   13.0506 Aldersbach-Kriestorf                     Bayern                                                                                            
00078 20041101 20211221             65     52.4853    7.9126 Alfhausen                                Niedersachsen                                                                                     
00091 20040901 20211221            300     50.7446    9.3450 Alsfeld-Eifa                             Hessen                                                                                            
00096 20190409 20211221             50     52.9437   12.8518 Neuruppin-Alt Ruppin                     Brandenburg                                                                                       
00102 20020101 20211221             32     53.8633    8.1275 Leuchtturm Alte Weser                    Niedersachsen

Also other column present the same problem: I tried using the skiprows argument but it doesn't work properly. The number of lines changes after I change the rows to skip, I guess it is not recognizing properly the row ending.

One brutal solution would be to manually modify the rows and convert the wrong white spaces in the values with dashes. But I was wondering if there was a better solution.

CodePudding user response：

I think that there is no easy solution because the file is not consistent with any convention of the csv. If we consider the white spaces as separator the first three columns use only one white space, but this character can be contained into the last two columns. So the white space cannot be used as separator.

The correct separator is the number of characters in each column that is constant. Unfortunately pandas doesn't provide a way to specify this kind of separators. My advice is to download the file as txt, read it and split the columns given the number of character of each column. Then create the csv and read it easily with pandas.

First of all I identified the length of the columns

columns = [5, 9, 9, 15, 12, 11, 41, 41]

Then I transform the txt into csv splitting each line

from itertools import islice

csv_file = []
with open('DWD_weather_stations.txt') as file:
    for i, row in enumerate(file.readlines()):
        if i == 0:
            headers = row.split()
            csv_file.append(";".join(headers))
            continue
        elif i == 1:
            continue
        
        iter_row = iter(row)
        csv_row = ';'.join(''.join(islice(iter_row, None, x)).strip() for x in columns)
        csv_file.append(csv_row)

with open('DWD_weather_stations.csv', "w") as file:
    file.write("\n".join(csv_file))

Several comments:

the first line is the header and it is analyzed separately since as separator can be use easily the white space and the number of characters for each column are not the same of the other rows
The second row is useless for this reason I skipped
The separators used for the CSV are ; for the column and \n for the row