I have a sample dataframe of company A's electronic consumption shown below
year-month | company | GWh |
---|---|---|
2017-01 | A | 100 |
2018-02 | A | 110 |
2019-01 | A | 90 |
2019-02 | A | 105 |
2020-01 | A | 117 |
2020-02 | A | 120 |
i would like to remove data of year 2020 and split the remaining dataframe into two sets:
- Train dataset contains records before year 2019
- Test dataset contains only 2019 records
CodePudding user response:
Coerce them into datetime and select as required. This is most preferable if you'll need to use them for time analysis
df['year-month']=pd.to_datetime(df['year-month']).dt.strftime('%Y-%m')
df1=df[df['year-month'].lt('2019')]
df2=df[df['year-month'].eq('2019')]
Following your comment, I believe thats a bug. I would edit it to
df['year-month']=pd.to_datetime(df['year-month'])
df1=df[df['year-month'].dt.strftime('%Y').lt('2019')]
df2=df[df['year-month'].dt.strftime('%Y').eq('2019')]
CodePudding user response:
Use Series.str.split
with Series.astype
, Series.eq
and Series.lt
:
In [358]: df1 = df[df['year-month'].str.split('-').str[0].astype(int).lt(2019)]
In [359]: df2 = df[df['year-month'].str.split('-').str[0].astype(int).eq(2019)]
In [360]: df1
Out[360]:
year-month company GWh
0 2017-01 A 100
1 2018-02 A 110
In [361]: df2
Out[361]:
year-month company GWh
2 2019-01 A 90
3 2019-02 A 105
CodePudding user response:
Since the dataset is a time or date-ordered event log, you can use a time-sensitive cross-validation splitter TimeSeriesSplit()
from sklearn.model_selection import TimeSeriesSplit
ts_cv = TimeSeriesSplit(
n_splits=1,
max_train_size=10000, # Here enter the count of records for years before 2019
test_size=1000, #Here enter the count of records in 2019
)
To manually inspect the split to check that the TimeSeriesSplit worked as we expect.
train, test = list(ts_cv.split(X, y))
X.iloc[test]
X.iloc[train]