I have a dataset with timestamped given below:
date type, price
1990-01-01, 'A', 100
1990-01-02, 'A', 200
1990-01-03, 'A', 300
1990-01-04, 'A', 400
1990-01-05, 'A', 500
1990-01-06, 'A', 600
1990-01-07, 'A', 700
1990-01-08, 'A', 800
1990-01-09, 'A', 900
1990-01-10, 'A', 1000
1990-01-11, 'B', 1100
1990-01-12, 'B', 1200
1990-01-13, 'B', 1300
1990-01-14, 'B', 1400
1990-01-15, 'B', 1500
I am trying to split this data as train and test with keeping the order based on date
. If the split ratio is 0.8 for train and test, the expected output is supposed to be the following data:
train_data:
date type, price
1990-01-01, 'A', 100
1990-01-02, 'A', 200
1990-01-03, 'A', 300
1990-01-04, 'A', 400
1990-01-05, 'A', 500
1990-01-06, 'A', 600
1990-01-07, 'A', 700
1990-01-08, 'A', 800
1990-01-11, 'B', 1100
1990-01-12, 'B', 1200
1990-01-13, 'B', 1300
1990-01-14, 'B', 1400
test_data:
date type, price
1990-01-09, 'A', 900
1990-01-10, 'A', 1000
1990-01-15, 'B', 1500
Is there any way to do this in a pythonic way?
CodePudding user response:
transform & transform
# grouper
g = df.groupby("type", sort=False).type
# first is 1..size second is [size, size, ...]
sample_nos = g.transform("cumcount").add(1)
group_sizes = g.transform("size")
# belongs to training or not
train_mask = sample_nos <= 0.8 * group_sizes
# then choose so
train_data = df[train_mask].copy()
test_data = df[~train_mask].copy()
train_data
date type price
0 1990-01-01 'A' 100
1 1990-01-02 'A' 200
2 1990-01-03 'A' 300
3 1990-01-04 'A' 400
4 1990-01-05 'A' 500
5 1990-01-06 'A' 600
6 1990-01-07 'A' 700
7 1990-01-08 'A' 800
10 1990-01-11 'B' 1100
11 1990-01-12 'B' 1200
12 1990-01-13 'B' 1300
13 1990-01-14 'B' 1400
and
test_data
date type price
8 1990-01-09 'A' 900
9 1990-01-10 'A' 1000
14 1990-01-15 'B' 1500
CodePudding user response:
You can use groupby
and apply
methods to split the data.
Code:
import io
import pandas as pd
# Create sample data as string
s = '''date,type,price
1990-01-01,A,100
1990-01-02,A,200
1990-01-03,A,300
1990-01-04,A,400
1990-01-05,A,500
1990-01-06,A,600
1990-01-07,A,700
1990-01-08,A,800
1990-01-09,A,900
1990-01-10,A,1000
1990-01-11,B,1100
1990-01-12,B,1200
1990-01-13,B,1300
1990-01-14,B,1400
1990-01-15,B,1500'''
# Read the sample
df = pd.read_csv(io.StringIO(s))
# Ensure that df is sorted by date at least
df = df.sort_values(['type', 'date']).reset_index(drop=True)
# Split df into train and test dataframes
split_ratio = 0.8
train_data = df.groupby('type', group_keys=False).apply(lambda df: df.head(int(split_ratio * len(df))))
test_data = df[~df.index.isin(train_data.index)]
Output:
# train_data:
date | type | price | |
---|---|---|---|
0 | 1990-01-01 | A | 100 |
1 | 1990-01-02 | A | 200 |
2 | 1990-01-03 | A | 300 |
3 | 1990-01-04 | A | 400 |
4 | 1990-01-05 | A | 500 |
5 | 1990-01-06 | A | 600 |
6 | 1990-01-07 | A | 700 |
7 | 1990-01-08 | A | 800 |
10 | 1990-01-11 | B | 1100 |
11 | 1990-01-12 | B | 1200 |
12 | 1990-01-13 | B | 1300 |
13 | 1990-01-14 | B | 1400 |
# test_data:
date | type | price | |
---|---|---|---|
8 | 1990-01-09 | A | 900 |
9 | 1990-01-10 | A | 1000 |
14 | 1990-01-15 | B | 1500 |