Home > Software design >  Split pandas dataframe
Split pandas dataframe

Time:11-12

I have a sample dataframe of company A's electronic consumption shown below

year-month company GWh
2017-01 A 100
2018-02 A 110
2019-01 A 90
2019-02 A 105
2020-01 A 117
2020-02 A 120

i would like to remove data of year 2020 and split the remaining dataframe into two sets:

  1. Train dataset contains records before year 2019
  2. Test dataset contains only 2019 records

CodePudding user response:

Coerce them into datetime and select as required. This is most preferable if you'll need to use them for time analysis

df['year-month']=pd.to_datetime(df['year-month']).dt.strftime('%Y-%m')
df1=df[df['year-month'].lt('2019')]
df2=df[df['year-month'].eq('2019')]

Following your comment, I believe thats a bug. I would edit it to

df['year-month']=pd.to_datetime(df['year-month'])
df1=df[df['year-month'].dt.strftime('%Y').lt('2019')]
df2=df[df['year-month'].dt.strftime('%Y').eq('2019')]

CodePudding user response:

Use Series.str.split with Series.astype, Series.eq and Series.lt:

In [358]: df1 = df[df['year-month'].str.split('-').str[0].astype(int).lt(2019)]

In [359]: df2 = df[df['year-month'].str.split('-').str[0].astype(int).eq(2019)]

In [360]: df1
Out[360]: 
  year-month company  GWh
0    2017-01       A  100
1    2018-02       A  110

In [361]: df2
Out[361]: 
  year-month company  GWh
2    2019-01       A   90
3    2019-02       A  105

CodePudding user response:

Since the dataset is a time or date-ordered event log, you can use a time-sensitive cross-validation splitter TimeSeriesSplit()

from sklearn.model_selection import TimeSeriesSplit

ts_cv = TimeSeriesSplit(
    n_splits=1,
    max_train_size=10000, # Here enter the count of records for years before 2019
    test_size=1000, #Here enter the count of records in 2019
)

To manually inspect the split to check that the TimeSeriesSplit worked as we expect.

train, test = list(ts_cv.split(X, y))
X.iloc[test]
X.iloc[train]
  • Related