Home > Mobile >  What is the criteria for a particular data to be qualifies as time series?
What is the criteria for a particular data to be qualifies as time series?

Time:01-14

What I am trying to do is trying to detect weather a dataset is time series or not? I want to automate this process.

Let's say I have the below datasets as:

df1:

Heading 1 Heading 2 Heading 1 Heading 2
1/1/2023 34 12 34
2/1/2023 42 99 42
3/1/2023 42 99 42
4/1/2023 42 99 42

df2:

Heading 1 Heading 2 Heading 1 Heading 2
1/1/2023 34 12 34
3/1/2023 42 99 42
4/1/2023 42 99 42
7/1/2023 42 99 42

df3:

Heading 1 Heading 2 Heading 1 Heading 2
Jan 2023 34 12 34
Feb 2023 42 99 42
Mar 2023 42 99 42

df4:

Heading 1 Heading 2 Heading 1 Heading 2
2020 34 12 34
2021 42 99 42
2022 42 99 42

df1 has time column which is evenly spaced, df2 has time column but it is not evenly spaced and df3 and df4 have a time column which is not in the format of datetime

Out of the above df, which one is a time series data and which is not? What exactly is the criteria for a dataset to be considered as time series?

Thanks!

CodePudding user response:

As @GalodoLeste indicates, your dataframes are time series:

df1['Heading 1'] = pd.to_datetime(df1['Heading 1'], dayfirst=True)
df2['Heading 1'] = pd.to_datetime(df2['Heading 1'], dayfirst=True)
df3['Heading 1'] = pd.to_datetime(df3['Heading 1'])
df4['Heading 1'] = pd.to_datetime(df4['Heading 1'], format='%Y')

but third has a frequency and one not:

>>> df1['Heading 1'].dt.freq
'D'

>>> df2['Heading 1'].dt.freq
None

>>> df3['Heading 1'].dt.freq
'MS'

>>> df4['Heading 1'].dt.freq
'AS-JAN'

CodePudding user response:

Let's assume this example:

  Heading 1  Heading 2  Heading 3  Heading 4  Heading 5 Heading 6 Heading 7
0  1/1/2023         34         12         34       2000  Jan 2023  1/1/2023
1  2/1/2023         42         99         42       2001  Feb 2023       NaN
2  3/1/2023         42         99         42       2002  Mar 2023       NaN
3  4/1/2023         42         99         42       2003       NaN       NaN

You can try to convert to_datetime with the default automated detection performed by pandas (that is very efficient!).

def find_datelike_cols(df):
    return df.columns[df.astype(str).apply(pd.to_datetime, errors='coerce').notna().any()]

cols = find_datelike_cols(df)
print(cols)

Output:

Index(['Heading 1', 'Heading 5', 'Heading 6', 'Heading 7'], dtype='object')

You can also add a minimal number of matching rows as threshold to determine that a column is datetime-like:

def find_datelike_cols(df, thresh=None):
    mask = df.astype(str).apply(pd.to_datetime, errors='coerce').notna()
    return df.columns[mask.sum()>=thresh if thresh else mask.any()]

find_datelike_cols(df)
# Index(['Heading 1', 'Heading 5', 'Heading 6', 'Heading 7'], dtype='object')

find_datelike_cols(df, thresh=3)
# Index(['Heading 1', 'Heading 5', 'Heading 6'], dtype='object')
  • Related