Creating a ML algorithm where the train data does not have same number of columns in all records-CodePudding

So I have the following train data (no header, explanation bellow):

[1.3264,1.3264,1.3263,1.32632]
[2.32598,2.3256,2.3257,2.326,2.3256,2.3257,2.32566]
[10.3215,10.3215,10.3214,10.3214,10.3214,10.32124]

It does not have an header because all elements with exception of the last 1 on each array are inputs and the last one is the result/output.

So taking first example: 1.3264,1.3264,1.3263 are inputs/feed data that I want to give to the algorith and 1.32632 is the outcome/result.

All of these are historical values that would lead to a pattern recognition. I would like to give some test data to the algorith and he would give me outcome/result based on that pattern he identified.

From all the examples I looked into with ML and sklearn, I have never seen one where you have(for the same type of data) multiple entries. They all seem to have the same number of columns and diferent types of inputs whereas mine is always the same type of input.

CodePudding user response：

You can try two different approaches:

Extract features from your variable length data to make the features have fixed size. After that you can use any algorithm from sklearn or other packages. Feature extraction is highly domain-specific process that requires context of what the data actually is. For example you can try similar features:

import numpy as np


def extract_features_one_row(arr):
    arr = np.array(arr[:-1])
    y = arr[-1]

    features = [
        np.mean(arr),
        np.sum(arr),
        np.median(arr),
        np.std(arr),
        np.percentile(arr, 5),
        np.percentile(arr, 95),
        np.percentile(arr, 25),
        np.percentile(arr, 75),
        (arr[1:] > arr[:-1]).sum(),  # number of increasing pairs
        (arr > arr.mean()).sum(),  # number of elements > mean value
        # extract trends, number of modes, etc
    ]
    return features, y


data = [
    [1.3264, 1.3264, 1.3263, 1.32632],
    [2.32598, 2.3256, 2.3257, 2.326, 2.3256, 2.3257, 2.32566],
    [10.3215, 10.3215, 10.3214, 10.3214, 10.3214, 10.32124],
]

X, y = zip(*[extract_features_one_row(row) for row in data])
X = np.array(X)  # (3, 10)
print(X.shape, y)

So now X_data have the same number of columns.

Use ML algorithm that supports variable length data: Recurrent neural networks, transformers, convolutional networks with padding.