How can I split train and test data based on some conditions?-CodePudding

How can I split train and test data based on some conditions for the machine learning models? The test data should include the same spatial areas (x-y) for each year. Namely, I don't want the same spatial area to be in the training and test set. For example:

import pandas as pd
data = {'x': [ 80.1, 90.1, 0, 300.1, 80.1, 90.1, 0, 300.1, 80.1, 90.1, 0, 300.1], 'y': [ 140.1, 150.1, 160.1, 400.1, 140.1, 150.1, 160.1, 400.1, 140.1, 150.1, 160.1, 400.1], 'a': [1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 15, 16], 'c': [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0], 'year': [2000, 2000, 2000, 2000, 2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002]}   
df = pd.DataFrame(data)
df
            
             x        y     a    c      year
        
        0   80.1    140.1   1   0.0     2000
        1   90.1    150.1   2   0.0     2000
        2   0.0     160.1   3   0.0     2000
        3   300.1   400.1   4   0.0     2000
        4   80.1    140.1   5   0.0     2001
        5   90.1    150.1   10  0.0     2001
        6   0.0     160.1   11  1.0     2001
        7   300.1   400.1   12  0.0     2001
        8   80.1    140.1   13  1.0     2002
        9   90.1    150.1   14  1.0     2002
        10  0.0     160.1   15  0.0     2002
        11  300.1   400.1   16  0.0     2002

    Expected train dataset:          
                  x       y     a      c     year   
            
            0   80.1    140.1   1     0.0    2000  
            1   90.1    150.1   2     0.0    2000   
             
            3   300.1   400.1   4     0.0    2000  
            4   80.1    140.1   5     0.0    2001  
            5   90.1    150.1   10    0.0    2001  
             
            7   300.1   400.1   12    0.0    2001  
            8   80.1    140.1   13    1.0    2002  
            9   90.1    150.1   14    1.0    2002   
            
            11  300.1   400.1   16    0.0    2002   
    
    Expected test dataset:           
                  x       y     a      c     year   
                           
            2   0.0     160.1   3     0.0    2000 
            
            6   0.0     160.1   11    1.0    2001  
             
            10  0.0     160.1   15    0.0    2002

CodePudding user response：

import numpy as np
import pandas as pd
data = {'x': [ 80.1, 90.1, 0, 300.1, 80.1, 90.1, 0, 300.1, 80.1, 90.1, 0, 300.1], 'y': [ 140.1, 150.1, 160.1, 400.1, 140.1, 150.1, 160.1, 400.1, 140.1, 150.1, 160.1, 400.1], 'a': [1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 15, 16], 'c': [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0], 'year': [2000, 2000, 2000, 2000, 2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002]}   

df1 = pd.DataFrame(data)

test_data = df1.loc[range(2, 12, 4)]
training_data = df1[~df1.isin(test_data)].dropna()

CodePudding user response：

You can create a group infer by slicing by 0 in x column

m = df.loc[::-1, 'x'].eq(0).cumsum()[::-1]

print(m)

0     3
1     3
2     3
3     2
4     2
5     2
6     2
7     1
8     1
9     1
10    1
11    0
Name: x, dtype: int64

Then group with this infer

df_train = df.groupby(m).apply(lambda group: group[group['x'].ne(0)])

          x      y   a  c  year
x
0 11  300.1  400.1  16  0  2002
1 7   300.1  400.1  12  0  2001
  8    80.1  140.1  13  1  2002
  9    90.1  150.1  14  1  2002
2 3   300.1  400.1   4  0  2000
  4    80.1  140.1   5  0  2001
  5    90.1  150.1  10  0  2001
3 0    80.1  140.1   1  0  2000
  1    90.1  150.1   2  0  2000

df_test = df.groupby(m).apply(lambda group: group[group['x'].eq(0)])

        x      y   a  c  year
x
1 10  0.0  160.1  15  0  2002
2 6   0.0  160.1  11  1  2001
3 2   0.0  160.1   3  0  2000