Home > front end >  (Only) some dates in date index get interpreted wrong after import from csv
(Only) some dates in date index get interpreted wrong after import from csv

Time:10-23

I want to analyse a dataframe in python. I loaded a csv which consists of two columns, one date/time and one mean value.
I loaded the data like this:

df = pd.read_csv('L_strom_30974_gerundet.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M', infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()

The problem is, that some dates seem to get interpreted wrong by python. The csv only ranges from 01.01.2009 00:00 to 04.10.2010 23:45 (original format). But when I load the file into python it also shows dates from November and December 2010 in the plot and df.info:

PeriodIndex: 61628 entries, 2009-01-01 00:00 to 2010-12-09 23:45
Freq: 15T
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Mean    61628 non-null  float64
dtypes: float64(1)

I searched in the csv for values from this time, but couldn't find any. Also, the number of entries in the df.info matches the rows of my csv, so I reckon that some dates must have been interpreted wrong.

The tail of my dataframe after the import looks like this:

                  Mean
Timestamp             
2010-12-09 22:45   186
2010-12-09 23:00   206
2010-12-09 23:15   168
2010-12-09 23:30   150
2010-12-09 23:45   132

I searched for similar problems, but could not find an explanation as to why most of the data is interpreted correctly, but some incorrectly. Any idea?

CodePudding user response:

The assumed need for infer_datetime_format=True gives away that you are not passing the correct format. Have a look at the strftime documentation. You are using:

format='%d.%m.%y %H:%M'
# %y = Year without century as a zero-padded decimal number: 09, 10

But the format required is:

format='%d.%m.%Y %H:%M'
# %Y = Year with century as a decimal number: 2009, 2010

Apparently, infer_datetime_format isn't able to infer each string correctly, taking days as months and vice versa. Indeed, let's reproduce the error:

Create csv:

import pandas as pd
import numpy as np

data = {'Timestamp': pd.date_range('01-01-2009', '10-04-2010', freq='H'),
        'Mean': np.random.randint(0,10,15385)}

df_orig = pd.DataFrame(data)
df_orig['Timestamp'] = df_orig['Timestamp'].dt.strftime('%d.%m.%Y %H:%M')
df_orig.to_csv('test.csv', sep=';', index=None, header=None)

# csv like:

01.01.2009 00:00;7
01.01.2009 01:00;6
01.01.2009 02:00;0
01.01.2009 03:00;2
01.01.2009 04:00;3

Load csv incorrectly:

df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])

df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M', 
                                 infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()

df.info() # note the incorrect `PeriodIndex`, ending with `2010-12-09 23:00`

<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-12-09 23:00
Freq: 15T
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Mean    15385 non-null  int64
dtypes: int64(1)
memory usage: 240.4 KB

Load csv correctly:

df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%Y %H:%M')
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df.info()

<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-10-04 00:00
Freq: 15T
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Mean    15385 non-null  int64
dtypes: int64(1)
memory usage: 240.4 KB
  • Related