I want to analyse a dataframe in python. I loaded a csv which consists of two columns, one date/time and one mean value.
I loaded the data like this:
df = pd.read_csv('L_strom_30974_gerundet.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M', infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()
The problem is, that some dates seem to get interpreted wrong by python. The csv only ranges from 01.01.2009 00:00 to 04.10.2010 23:45 (original format). But when I load the file into python it also shows dates from November and December 2010 in the plot and df.info:
PeriodIndex: 61628 entries, 2009-01-01 00:00 to 2010-12-09 23:45
Freq: 15T
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean 61628 non-null float64
dtypes: float64(1)
I searched in the csv for values from this time, but couldn't find any. Also, the number of entries in the df.info matches the rows of my csv, so I reckon that some dates must have been interpreted wrong.
The tail of my dataframe after the import looks like this:
Mean
Timestamp
2010-12-09 22:45 186
2010-12-09 23:00 206
2010-12-09 23:15 168
2010-12-09 23:30 150
2010-12-09 23:45 132
I searched for similar problems, but could not find an explanation as to why most of the data is interpreted correctly, but some incorrectly. Any idea?
CodePudding user response:
The assumed need for infer_datetime_format=True
gives away that you are not passing the correct format. Have a look at the strftime documentation
. You are using:
format='%d.%m.%y %H:%M'
# %y = Year without century as a zero-padded decimal number: 09, 10
But the format required is:
format='%d.%m.%Y %H:%M'
# %Y = Year with century as a decimal number: 2009, 2010
Apparently, infer_datetime_format
isn't able to infer each string correctly, taking days as months and vice versa. Indeed, let's reproduce the error:
Create csv:
import pandas as pd
import numpy as np
data = {'Timestamp': pd.date_range('01-01-2009', '10-04-2010', freq='H'),
'Mean': np.random.randint(0,10,15385)}
df_orig = pd.DataFrame(data)
df_orig['Timestamp'] = df_orig['Timestamp'].dt.strftime('%d.%m.%Y %H:%M')
df_orig.to_csv('test.csv', sep=';', index=None, header=None)
# csv like:
01.01.2009 00:00;7
01.01.2009 01:00;6
01.01.2009 02:00;0
01.01.2009 03:00;2
01.01.2009 04:00;3
Load csv incorrectly:
df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M',
infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()
df.info() # note the incorrect `PeriodIndex`, ending with `2010-12-09 23:00`
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-12-09 23:00
Freq: 15T
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean 15385 non-null int64
dtypes: int64(1)
memory usage: 240.4 KB
Load csv correctly:
df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%Y %H:%M')
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df.info()
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-10-04 00:00
Freq: 15T
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean 15385 non-null int64
dtypes: int64(1)
memory usage: 240.4 KB