What is the criteria for a particular data to be qualifies as time series?-CodePudding

What I am trying to do is trying to detect weather a dataset is time series or not? I want to automate this process.

Let's say I have the below datasets as:

df1:

Heading 1	Heading 2	Heading 1	Heading 2
1/1/2023	34	12	34
2/1/2023	42	99	42
3/1/2023	42	99	42
4/1/2023	42	99	42

df2:

Heading 1	Heading 2	Heading 1	Heading 2
1/1/2023	34	12	34
3/1/2023	42	99	42
4/1/2023	42	99	42
7/1/2023	42	99	42

df3:

Heading 1	Heading 2	Heading 1	Heading 2
Jan 2023	34	12	34
Feb 2023	42	99	42
Mar 2023	42	99	42

df4:

Heading 1	Heading 2	Heading 1	Heading 2
2020	34	12	34
2021	42	99	42
2022	42	99	42

df1 has time column which is evenly spaced, df2 has time column but it is not evenly spaced and df3 and df4 have a time column which is not in the format of datetime

Out of the above df, which one is a time series data and which is not? What exactly is the criteria for a dataset to be considered as time series?

Thanks!

CodePudding user response：

As @GalodoLeste indicates, your dataframes are time series:

df1['Heading 1'] = pd.to_datetime(df1['Heading 1'], dayfirst=True)
df2['Heading 1'] = pd.to_datetime(df2['Heading 1'], dayfirst=True)
df3['Heading 1'] = pd.to_datetime(df3['Heading 1'])
df4['Heading 1'] = pd.to_datetime(df4['Heading 1'], format='%Y')

but third has a frequency and one not:

>>> df1['Heading 1'].dt.freq
'D'

>>> df2['Heading 1'].dt.freq
None

>>> df3['Heading 1'].dt.freq
'MS'

>>> df4['Heading 1'].dt.freq
'AS-JAN'

CodePudding user response：

Let's assume this example:

  Heading 1  Heading 2  Heading 3  Heading 4  Heading 5 Heading 6 Heading 7
0  1/1/2023         34         12         34       2000  Jan 2023  1/1/2023
1  2/1/2023         42         99         42       2001  Feb 2023       NaN
2  3/1/2023         42         99         42       2002  Mar 2023       NaN
3  4/1/2023         42         99         42       2003       NaN       NaN

You can try to convert to_datetime with the default automated detection performed by pandas (that is very efficient!).

def find_datelike_cols(df):
    return df.columns[df.astype(str).apply(pd.to_datetime, errors='coerce').notna().any()]

cols = find_datelike_cols(df)
print(cols)

Output:

Index(['Heading 1', 'Heading 5', 'Heading 6', 'Heading 7'], dtype='object')

You can also add a minimal number of matching rows as threshold to determine that a column is datetime-like:

def find_datelike_cols(df, thresh=None):
    mask = df.astype(str).apply(pd.to_datetime, errors='coerce').notna()
    return df.columns[mask.sum()>=thresh if thresh else mask.any()]

find_datelike_cols(df)
# Index(['Heading 1', 'Heading 5', 'Heading 6', 'Heading 7'], dtype='object')

find_datelike_cols(df, thresh=3)
# Index(['Heading 1', 'Heading 5', 'Heading 6'], dtype='object')