What I am trying to do is trying to detect weather a dataset is time series or not? I want to automate this process.
Let's say I have the below datasets as:
df1:
Heading 1 | Heading 2 | Heading 1 | Heading 2 |
---|---|---|---|
1/1/2023 | 34 | 12 | 34 |
2/1/2023 | 42 | 99 | 42 |
3/1/2023 | 42 | 99 | 42 |
4/1/2023 | 42 | 99 | 42 |
df2:
Heading 1 | Heading 2 | Heading 1 | Heading 2 |
---|---|---|---|
1/1/2023 | 34 | 12 | 34 |
3/1/2023 | 42 | 99 | 42 |
4/1/2023 | 42 | 99 | 42 |
7/1/2023 | 42 | 99 | 42 |
df3:
Heading 1 | Heading 2 | Heading 1 | Heading 2 |
---|---|---|---|
Jan 2023 | 34 | 12 | 34 |
Feb 2023 | 42 | 99 | 42 |
Mar 2023 | 42 | 99 | 42 |
df4:
Heading 1 | Heading 2 | Heading 1 | Heading 2 |
---|---|---|---|
2020 | 34 | 12 | 34 |
2021 | 42 | 99 | 42 |
2022 | 42 | 99 | 42 |
df1
has time column which is evenly spaced, df2
has time column but it is not evenly spaced and df3
and df4
have a time column which is not in the format of datetime
Out of the above df
, which one is a time series data and which is not? What exactly is the criteria for a dataset to be considered as time series?
Thanks!
CodePudding user response:
As @GalodoLeste indicates, your dataframes are time series:
df1['Heading 1'] = pd.to_datetime(df1['Heading 1'], dayfirst=True)
df2['Heading 1'] = pd.to_datetime(df2['Heading 1'], dayfirst=True)
df3['Heading 1'] = pd.to_datetime(df3['Heading 1'])
df4['Heading 1'] = pd.to_datetime(df4['Heading 1'], format='%Y')
but third has a frequency and one not:
>>> df1['Heading 1'].dt.freq
'D'
>>> df2['Heading 1'].dt.freq
None
>>> df3['Heading 1'].dt.freq
'MS'
>>> df4['Heading 1'].dt.freq
'AS-JAN'
CodePudding user response:
Let's assume this example:
Heading 1 Heading 2 Heading 3 Heading 4 Heading 5 Heading 6 Heading 7
0 1/1/2023 34 12 34 2000 Jan 2023 1/1/2023
1 2/1/2023 42 99 42 2001 Feb 2023 NaN
2 3/1/2023 42 99 42 2002 Mar 2023 NaN
3 4/1/2023 42 99 42 2003 NaN NaN
You can try to convert to_datetime
with the default automated detection performed by pandas (that is very efficient!).
def find_datelike_cols(df):
return df.columns[df.astype(str).apply(pd.to_datetime, errors='coerce').notna().any()]
cols = find_datelike_cols(df)
print(cols)
Output:
Index(['Heading 1', 'Heading 5', 'Heading 6', 'Heading 7'], dtype='object')
You can also add a minimal number of matching rows as threshold to determine that a column is datetime-like:
def find_datelike_cols(df, thresh=None):
mask = df.astype(str).apply(pd.to_datetime, errors='coerce').notna()
return df.columns[mask.sum()>=thresh if thresh else mask.any()]
find_datelike_cols(df)
# Index(['Heading 1', 'Heading 5', 'Heading 6', 'Heading 7'], dtype='object')
find_datelike_cols(df, thresh=3)
# Index(['Heading 1', 'Heading 5', 'Heading 6'], dtype='object')