How can I split train and test data based on some conditions for the machine learning models? The test data should include the same spatial areas (x-y) for each year. Namely, I don't want the same spatial area to be in the training and test set. For example:
import pandas as pd
data = {'x': [ 80.1, 90.1, 0, 300.1, 80.1, 90.1, 0, 300.1, 80.1, 90.1, 0, 300.1], 'y': [ 140.1, 150.1, 160.1, 400.1, 140.1, 150.1, 160.1, 400.1, 140.1, 150.1, 160.1, 400.1], 'a': [1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 15, 16], 'c': [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0], 'year': [2000, 2000, 2000, 2000, 2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002]}
df = pd.DataFrame(data)
df
x y a c year
0 80.1 140.1 1 0.0 2000
1 90.1 150.1 2 0.0 2000
2 0.0 160.1 3 0.0 2000
3 300.1 400.1 4 0.0 2000
4 80.1 140.1 5 0.0 2001
5 90.1 150.1 10 0.0 2001
6 0.0 160.1 11 1.0 2001
7 300.1 400.1 12 0.0 2001
8 80.1 140.1 13 1.0 2002
9 90.1 150.1 14 1.0 2002
10 0.0 160.1 15 0.0 2002
11 300.1 400.1 16 0.0 2002
Expected train dataset:
x y a c year
0 80.1 140.1 1 0.0 2000
1 90.1 150.1 2 0.0 2000
3 300.1 400.1 4 0.0 2000
4 80.1 140.1 5 0.0 2001
5 90.1 150.1 10 0.0 2001
7 300.1 400.1 12 0.0 2001
8 80.1 140.1 13 1.0 2002
9 90.1 150.1 14 1.0 2002
11 300.1 400.1 16 0.0 2002
Expected test dataset:
x y a c year
2 0.0 160.1 3 0.0 2000
6 0.0 160.1 11 1.0 2001
10 0.0 160.1 15 0.0 2002
CodePudding user response:
import numpy as np
import pandas as pd
data = {'x': [ 80.1, 90.1, 0, 300.1, 80.1, 90.1, 0, 300.1, 80.1, 90.1, 0, 300.1], 'y': [ 140.1, 150.1, 160.1, 400.1, 140.1, 150.1, 160.1, 400.1, 140.1, 150.1, 160.1, 400.1], 'a': [1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 15, 16], 'c': [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0], 'year': [2000, 2000, 2000, 2000, 2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002]}
df1 = pd.DataFrame(data)
test_data = df1.loc[range(2, 12, 4)]
training_data = df1[~df1.isin(test_data)].dropna()
CodePudding user response:
You can create a group infer by slicing by 0
in x
column
m = df.loc[::-1, 'x'].eq(0).cumsum()[::-1]
print(m)
0 3
1 3
2 3
3 2
4 2
5 2
6 2
7 1
8 1
9 1
10 1
11 0
Name: x, dtype: int64
Then group with this infer
df_train = df.groupby(m).apply(lambda group: group[group['x'].ne(0)])
x y a c year
x
0 11 300.1 400.1 16 0 2002
1 7 300.1 400.1 12 0 2001
8 80.1 140.1 13 1 2002
9 90.1 150.1 14 1 2002
2 3 300.1 400.1 4 0 2000
4 80.1 140.1 5 0 2001
5 90.1 150.1 10 0 2001
3 0 80.1 140.1 1 0 2000
1 90.1 150.1 2 0 2000
df_test = df.groupby(m).apply(lambda group: group[group['x'].eq(0)])
x y a c year
x
1 10 0.0 160.1 15 0 2002
2 6 0.0 160.1 11 1 2001
3 2 0.0 160.1 3 0 2000