(Only) some dates in date index get interpreted wrong after import from csv-CodePudding

I want to analyse a dataframe in python. I loaded a csv which consists of two columns, one date/time and one mean value.
I loaded the data like this:

df = pd.read_csv('L_strom_30974_gerundet.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M', infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()

The problem is, that some dates seem to get interpreted wrong by python. The csv only ranges from 01.01.2009 00:00 to 04.10.2010 23:45 (original format). But when I load the file into python it also shows dates from November and December 2010 in the plot and df.info:

PeriodIndex: 61628 entries, 2009-01-01 00:00 to 2010-12-09 23:45
Freq: 15T
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Mean    61628 non-null  float64
dtypes: float64(1)

I searched in the csv for values from this time, but couldn't find any. Also, the number of entries in the df.info matches the rows of my csv, so I reckon that some dates must have been interpreted wrong.

The tail of my dataframe after the import looks like this:

                  Mean
Timestamp             
2010-12-09 22:45   186
2010-12-09 23:00   206
2010-12-09 23:15   168
2010-12-09 23:30   150
2010-12-09 23:45   132

I searched for similar problems, but could not find an explanation as to why most of the data is interpreted correctly, but some incorrectly. Any idea?

CodePudding user response：

The assumed need for infer_datetime_format=True gives away that you are not passing the correct format. Have a look at the strftime documentation. You are using:

format='%d.%m.%y %H:%M'
# %y = Year without century as a zero-padded decimal number: 09, 10

But the format required is:

format='%d.%m.%Y %H:%M'
# %Y = Year with century as a decimal number: 2009, 2010

Apparently, infer_datetime_format isn't able to infer each string correctly, taking days as months and vice versa. Indeed, let's reproduce the error:

Create csv:

import pandas as pd
import numpy as np

data = {'Timestamp': pd.date_range('01-01-2009', '10-04-2010', freq='H'),
        'Mean': np.random.randint(0,10,15385)}

df_orig = pd.DataFrame(data)
df_orig['Timestamp'] = df_orig['Timestamp'].dt.strftime('%d.%m.%Y %H:%M')
df_orig.to_csv('test.csv', sep=';', index=None, header=None)

# csv like:

01.01.2009 00:00;7
01.01.2009 01:00;6
01.01.2009 02:00;0
01.01.2009 03:00;2
01.01.2009 04:00;3

Load csv incorrectly:

df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])

df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M', 
                                 infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()

df.info() # note the incorrect `PeriodIndex`, ending with `2010-12-09 23:00`

<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-12-09 23:00
Freq: 15T
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Mean    15385 non-null  int64
dtypes: int64(1)
memory usage: 240.4 KB

Load csv correctly:

df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%Y %H:%M')
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df.info()

<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-10-04 00:00
Freq: 15T
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Mean    15385 non-null  int64
dtypes: int64(1)
memory usage: 240.4 KB