Home > database >  train_test_split function in C
train_test_split function in C

Time:09-22

I would like to create a train_test_split function that splits a matrix (vector of vectors) of data into two other matrices, similar to what sklearn's function does. This is my attempt in doing so:

#include <iostream> 
#include <cstdlib>
#include <fstream> 
#include <time.h>
#include <vector>  
#include <string> 

using namespace std;

vector<vector<float>> train_test_split(vector<vector<float>> df, float train_size = 0.8){
  vector<vector<float>> train; 
  vector<vector<float>> test; 
  srand(time(NULL)); 
  for(int i = 0; i < df.size(); i  ){
    int x = rand() % 10   1; 
    if(x <= train_size * 10){
      train.push_back(df[i]);
    } 
    else{
      test.push_back(df[i]);
    }
  }
  return train, test;
} 

int main(){
   vector<vector<float>> train;
   vector<vector<float>> test; 
   vector<vector<float>> df = {{1,2,3,4}, 
                               {5,6,7,8},
                               {9,10,11,12}};

   train, test = train_test_split(df); 
   cout << "training size: " << train.size() << ", test size: " << test.size() << endl; 
   return 0; 
}

This approach sends data only in the test matrix. After some research, I have discovered that C cannot output two values in the same function. I am very new in C , and I am wondering what would be the best way to approach this. Any help will be appreciated.

CodePudding user response:

A function can only return one value. Though look at your function declaration: It is declared to return a vector<vector<float>>, and thats a container of many vector<float>s. Containers can contain many elements (of same type) and custom types can contain many members:

 struct train_test_split_result {
      vector<vector<float>> train; 
      vector<vector<float>> test; 
 };

 train_test_split_result train_test_split(vector<vector<float>> df, int train_size = 0.8) {
      train_test_split_result result;
      // ...
      // result.train.push_back(...)
      // result.test.push_back(...)
      // ...
      return result;
}

int main(){
   vector<vector<float>> df = {{1,2,3,4}, 
                               {5,6,7,8},
                               {9,10,11,12}};

   train_test_split_result result = train_test_split(df); 
   cout << "training size: " << result.train.size() << ", test size: " << result.test.size() << endl; 
}

PS: You should turn up your compilers warnings and read them! Then read this: How does the Comma Operator work

PPS: A nested vector is a terrible data structure for a matrix. std::vector benefits a lot from memory locality, but because its elements are dynamically allocated, the floats in a std::vector<std::vector<float>> are scattered around in memory. If the size is known at compile time and not too big (that it would require dynamic allocation) you can use a nested array. Alternatively use a flat std::vector<float> to store the matrix.

PPPS: There are also "out paramters": The function can have arguments by non-const reference, the caller passes them and the function modifies them. Though generally out-parameters are not recommended.

  • Related