how to split time stamped data as train and test-CodePudding

I have a dataset with timestamped given below:

date        type, price
1990-01-01, 'A', 100
1990-01-02, 'A', 200
1990-01-03, 'A', 300
1990-01-04, 'A', 400
1990-01-05, 'A', 500
1990-01-06, 'A', 600
1990-01-07, 'A', 700
1990-01-08, 'A', 800
1990-01-09, 'A', 900
1990-01-10, 'A', 1000
1990-01-11, 'B', 1100
1990-01-12, 'B', 1200
1990-01-13, 'B', 1300
1990-01-14, 'B', 1400
1990-01-15, 'B', 1500

I am trying to split this data as train and test with keeping the order based on date. If the split ratio is 0.8 for train and test, the expected output is supposed to be the following data: train_data:

date        type, price
1990-01-01, 'A', 100
1990-01-02, 'A', 200
1990-01-03, 'A', 300
1990-01-04, 'A', 400
1990-01-05, 'A', 500
1990-01-06, 'A', 600
1990-01-07, 'A', 700
1990-01-08, 'A', 800
1990-01-11, 'B', 1100
1990-01-12, 'B', 1200
1990-01-13, 'B', 1300
1990-01-14, 'B', 1400

test_data:

    date        type, price
    1990-01-09, 'A', 900
    1990-01-10, 'A', 1000
    1990-01-15, 'B', 1500

Is there any way to do this in a pythonic way?

CodePudding user response：

transform & transform

# grouper
g = df.groupby("type", sort=False).type

# first is 1..size second is [size, size, ...]
sample_nos  = g.transform("cumcount").add(1)
group_sizes = g.transform("size")

# belongs to training or not
train_mask = sample_nos <= 0.8 * group_sizes

# then choose so
train_data = df[train_mask].copy()
test_data  = df[~train_mask].copy()

train_data

          date type  price
0   1990-01-01  'A'    100
1   1990-01-02  'A'    200
2   1990-01-03  'A'    300
3   1990-01-04  'A'    400
4   1990-01-05  'A'    500
5   1990-01-06  'A'    600
6   1990-01-07  'A'    700
7   1990-01-08  'A'    800
10  1990-01-11  'B'   1100
11  1990-01-12  'B'   1200
12  1990-01-13  'B'   1300
13  1990-01-14  'B'   1400

and

test_data

          date type  price
8   1990-01-09  'A'    900
9   1990-01-10  'A'   1000
14  1990-01-15  'B'   1500

CodePudding user response：

You can use groupby and apply methods to split the data.

Code:

import io
import pandas as pd

# Create sample data as string
s = '''date,type,price
1990-01-01,A,100
1990-01-02,A,200
1990-01-03,A,300
1990-01-04,A,400
1990-01-05,A,500
1990-01-06,A,600
1990-01-07,A,700
1990-01-08,A,800
1990-01-09,A,900
1990-01-10,A,1000
1990-01-11,B,1100
1990-01-12,B,1200
1990-01-13,B,1300
1990-01-14,B,1400
1990-01-15,B,1500'''

# Read the sample
df = pd.read_csv(io.StringIO(s))

# Ensure that df is sorted by date at least
df = df.sort_values(['type', 'date']).reset_index(drop=True)

# Split df into train and test dataframes
split_ratio = 0.8
train_data = df.groupby('type', group_keys=False).apply(lambda df: df.head(int(split_ratio * len(df))))
test_data = df[~df.index.isin(train_data.index)]

Output:

# train_data:

	date	type	price
0	1990-01-01	A	100
1	1990-01-02	A	200
2	1990-01-03	A	300
3	1990-01-04	A	400
4	1990-01-05	A	500
5	1990-01-06	A	600
6	1990-01-07	A	700
7	1990-01-08	A	800
10	1990-01-11	B	1100
11	1990-01-12	B	1200
12	1990-01-13	B	1300
13	1990-01-14	B	1400

# test_data:

	date	type	price
8	1990-01-09	A	900
9	1990-01-10	A	1000
14	1990-01-15	B	1500