Split pandas dataframe-CodePudding

I have a sample dataframe of company A's electronic consumption shown below

year-month	company	GWh
2017-01	A	100
2018-02	A	110
2019-01	A	90
2019-02	A	105
2020-01	A	117
2020-02	A	120

i would like to remove data of year 2020 and split the remaining dataframe into two sets:

Train dataset contains records before year 2019
Test dataset contains only 2019 records

CodePudding user response：

Coerce them into datetime and select as required. This is most preferable if you'll need to use them for time analysis

df['year-month']=pd.to_datetime(df['year-month']).dt.strftime('%Y-%m')
df1=df[df['year-month'].lt('2019')]
df2=df[df['year-month'].eq('2019')]

Following your comment, I believe thats a bug. I would edit it to

df['year-month']=pd.to_datetime(df['year-month'])
df1=df[df['year-month'].dt.strftime('%Y').lt('2019')]
df2=df[df['year-month'].dt.strftime('%Y').eq('2019')]

CodePudding user response：

Use Series.str.split with Series.astype, Series.eq and Series.lt:

In [358]: df1 = df[df['year-month'].str.split('-').str[0].astype(int).lt(2019)]

In [359]: df2 = df[df['year-month'].str.split('-').str[0].astype(int).eq(2019)]

In [360]: df1
Out[360]: 
  year-month company  GWh
0    2017-01       A  100
1    2018-02       A  110

In [361]: df2
Out[361]: 
  year-month company  GWh
2    2019-01       A   90
3    2019-02       A  105

CodePudding user response：

Since the dataset is a time or date-ordered event log, you can use a time-sensitive cross-validation splitter TimeSeriesSplit()

from sklearn.model_selection import TimeSeriesSplit

ts_cv = TimeSeriesSplit(
    n_splits=1,
    max_train_size=10000, # Here enter the count of records for years before 2019
    test_size=1000, #Here enter the count of records in 2019
)

To manually inspect the split to check that the TimeSeriesSplit worked as we expect.

train, test = list(ts_cv.split(X, y))
X.iloc[test]
X.iloc[train]