Aparapi cannot resolve max and falls back to CPU-CodePudding

So I'm designing a CNN in Java and I'm down to the point where I really wanna parallelize the convolution and pooling. This is my approach(rows, columns, inputLayer, convLayer, poolLayer and features have been initialized already in the constructor):

    int padding = 3;
    int filterSize = 2 * padding   1;
    int[] input = new int[rows * columns];
    for(int r = 0; r < rows; r  )
        System.arraycopy(inputLayer[r], 0, input, r * columns, columns);
    int[] filters = new int[4 * filterSize * filterSize];
    for(int fl = 0; fl < 4; fl  )
        for(int fr = 0; fr < filterSize; fr  )
            System.arraycopy(features[fl][fr], 0, filters, fl * filterSize * filterSize   fr * filterSize, filterSize);
    float[] conv = new float[4 * rows * columns];
    float[] pool = new float[rows * columns];

    Range convRange = Range.create3D(columns, rows, 4, 2, 2, 2);
    Kernel convKernel = new Kernel(){
        int h = rows;
        int w = columns;
        int p = padding;
        int fs = filterSize;
        public void run(){
            int val = 0;
            int c = getGlobalId(0);
            int r = getGlobalId(1);
            int l = getGlobalId(2);
            int upper = max(0, p - r);
            int lower = min(fs, h   p - r);
            int left = max(0, p - c);
            int right = min(fs, w   p - c);
            for (int i = upper; i < lower; i  )
                for (int j = left; j < right; j  )
                    val  = input[(r   i - p) * w   c   j - p] * filters[l * fs * fs   i * fs   j];
            conv[l * h * w   r * w   c] = Math.round(100.00f * val / fs) / 100.00f;
        }
    };
    convKernel.setExplicit(true);
    convKernel.put(input);
    convKernel.put(conv);
    convKernel.put(filters);
    convKernel.execute(convRange);
    convKernel.get(conv);
    for(int convL = 0; convL < 4; convL  )
        for(int convR = 0; convR < rows; convR  )
            System.arraycopy(conv, convL * rows * columns   convR * columns, convLayer[convL][convR], 0, columns);

    Range poolRange = Range.create3D(columns / 2, rows / 2, 4, 2, 2, 2);
    Kernel poolKernel = new Kernel(){
        public void run(){
            int wt = columns;
            int ht = rows;
            float val = 0.00f;
            int c = getGlobalId(0);
            int r = getGlobalId(1);
            int l = getGlobalId(2);
            for(int i = 0; i < 2; i  )
                for(int j = 0; j < 2; j  )
                    val = max(val, leakyReLU(conv[l * ht * wt   (2 * r   i) * wt   2 * c   j]));
            pool[(l * ht * wt / 4)   (r * wt / 2)   c] = Math.round(100.00f * val) / 100.00f;
        }
    };
    poolKernel.setExplicit(true);
    poolKernel.put(conv);
    poolKernel.put(pool);
    poolKernel.execute(poolRange);
    poolKernel.get(pool);
    for(int poolL = 0; poolL < 4; poolL  )
        for(int poolR = 0; poolR < rows / 2; poolR  )
            System.arraycopy(pool, (poolL * rows * columns / 4)   (poolR * columns / 2), poolLayer[poolL][poolR], 0, columns / 2);

Not the prettiest piece of code but I haven't used Java in ages, let alone Aparapi.

Initially I used directly the original arrays, but the api showed a message that it doesn't support them and switched to native mode. Converting everything to 1d arrays is supposed to work but now I get this message:

VIII 09, 2022 9:03:02 PM com.aparapi.internal.model.MethodModel init WARNING: Method max(FF)F does not contain a LocalVariableTable entry (source not compiled with -g) codegen will attempt to create a synthetic table based on bytecode. This is experimental!! VIII 09, 2022 9:03:02 PM com.aparapi.internal.kernel.KernelRunner fallBackToNextDevice WARNING: Device failed for NeuralNetwork$2, devices={NVIDIA|Intel|Java Alternative Algorithm|Java Thread Pool}: null

So it looks like poolKernel can't resolve the max function and the whole thing falls back to CPU.

When debugging, I can confirm that it only uses 12 threads - the amount supported by my Intel Core i7. The GPU is an NVIDIA GeForce GTX 1650 with 896 cores so that's what I would expect to see.

Also, at the end it says:

WARNING: Aparapi is running on an untested OpenCL platform version: OpenCL 3.0 CUDA 11.3.123 WARNING: Aparapi is running on an untested OpenCL platform version: OpenCL 3.0

What am I missing? P.S.: As you would imagine, I'm new to both conv nets and GPGPU. I know there's a library that contains all needed cnn functions(cudnn) but I want to implement it by myself to really understand how it works.

CodePudding user response：

Figured it out - leakyReLU also uses a max with floats which I totally forgot... I replaced both with if statements. Now the only error message that I get is that, according to the api, there are objects passed to the kernel(this is not supported). But I don't see any objects... If someone can help with that part, please chime in.

    int padding = 3;
    int filterSize = 2 * padding   1;
    int[] input = new int[rows * columns];
    for(int r = 0; r < rows; r  )
        System.arraycopy(inputLayer[r], 0, input, r * columns, columns);
    int[] filters = new int[4 * filterSize * filterSize];
    for(int fl = 0; fl < 4; fl  )
        for(int fr = 0; fr < filterSize; fr  )
            System.arraycopy(features[fl][fr], 0, filters, fl * filterSize * filterSize   fr * filterSize, filterSize);
    float[] conv = new float[4 * rows * columns];
    float[] pool = new float[rows * columns];

    Range convRange = Range.create3D(columns, rows, 4);
    Kernel convKernel = new Kernel(){
        int h = rows;
        int w = columns;
        int p = padding;
        int fs = filterSize;
        public void run(){
            int val = 0;
            int c = getGlobalId(0);
            int r = getGlobalId(1);
            int l = getGlobalId(2);
            int upper = max(0, p - r);
            int lower = min(fs, h   p - r);
            int left = max(0, p - c);
            int right = min(fs, w   p - c);
            for (int i = upper; i < lower; i  )
                for (int j = left; j < right; j  )
                    val  = input[(r   i - p) * w   c   j - p] * filters[l * fs * fs   i * fs   j];
            conv[l * h * w   r * w   c] = Math.round(100.00f * val / fs) / 100.00f;
        }
    };
    convKernel.setExplicit(true);
    convKernel.put(input);
    convKernel.put(conv);
    convKernel.put(filters);
    convKernel.execute(convRange);
    convKernel.get(conv);
    for(int convL = 0; convL < 4; convL  )
        for(int convR = 0; convR < rows; convR  )
            System.arraycopy(conv, convL * rows * columns   convR * columns, convLayer[convL][convR], 0, columns);

    Range poolRange = Range.create3D(columns / 2, rows / 2, 4);
    Kernel poolKernel = new Kernel(){
        public void run(){
            int wt = columns;
            int ht = rows;
            float coef = coefficient;
            float val = 0.00f;
            int c = getGlobalId(0);
            int r = getGlobalId(1);
            int l = getGlobalId(2);
            for(int i = 0; i < 2; i  )
                for (int j = 0; j < 2; j  ) {
                    float tmp = conv[l * ht * wt   (2 * r   i) * wt   2 * c   j];
                    if(tmp < 0) tmp = tmp * coef;
                    if (val < tmp) val = tmp;
                }
            pool[(l * ht * wt / 4)   (r * wt / 2)   c] = Math.round(100.00f * val) / 100.00f;
        }
    };
    poolKernel.setExplicit(true);
    poolKernel.put(conv);
    poolKernel.put(pool);
    poolKernel.execute(poolRange);
    poolKernel.get(pool);
    for(int poolL = 0; poolL < 4; poolL  )
        for(int poolR = 0; poolR < rows / 2; poolR  )
            System.arraycopy(pool, (poolL * rows * columns / 4)   (poolR * columns / 2), poolLayer[poolL][poolR], 0, columns / 2);

CodePudding user response：

Well... Sometimes, apparently, one needs to write down one's question to be able to answer it. Did some reworking and now all errors seem to be gone:

    int padding = 3;
    int filterSize = 2 * padding   1;
    int[] params = {rows, columns, padding, filterSize};
    int[] input = new int[rows * columns];
    for(int r = 0; r < rows; r  )
        System.arraycopy(inputLayer[r], 0, input, r * columns, columns);
    int[] filters = new int[4 * filterSize * filterSize];
    for(int fl = 0; fl < 4; fl  )
        for(int fr = 0; fr < filterSize; fr  )
            System.arraycopy(features[fl][fr], 0, filters, fl * filterSize * filterSize   fr * filterSize, filterSize);
    float[] conv = new float[4 * rows * columns];
    float[] pool = new float[rows * columns];

    Range convRange = Range.create3D(columns, rows, 4);
    Kernel convKernel = new Kernel(){
        final int h = params[0];
        final int w = params[1];
        final int p = params[2];
        final int fs = params[3];
        public void run(){
            int val = 0;
            final int c = getGlobalId(0);
            final int r = getGlobalId(1);
            final int l = getGlobalId(2);
            final int upper = max(0, p - r);
            final int lower = min(fs, h   p - r);
            final int left = max(0, p - c);
            final int right = min(fs, w   p - c);
            for (int i = upper; i < lower; i  )
                for (int j = left; j < right; j  )
                    val  = input[(r   i - p) * w   c   j - p] * filters[l * fs * fs   i * fs   j];
            conv[l * h * w   r * w   c] = Math.round(100.00f * val / fs) / 100.00f;
        }
    };
    convKernel.setExplicit(true);
    convKernel.put(params);
    convKernel.put(input);
    convKernel.put(conv);
    convKernel.put(filters);
    convKernel.execute(convRange);
    convKernel.get(conv);
    for(int convL = 0; convL < 4; convL  )
        for(int convR = 0; convR < rows; convR  )
            System.arraycopy(conv, convL * rows * columns   convR * columns, convLayer[convL][convR], 0, columns);

    Range poolRange = Range.create3D(columns / 2, rows / 2, 4);
    Kernel poolKernel = new Kernel(){
        final int ht = params[0];
        final int wt = params[1];
        public void run(){
            //final float coef = coefficient;
            float val = 0.00f;
            final int c = getGlobalId(0);
            final int r = getGlobalId(1);
            final int l = getGlobalId(2);
            for(int i = 0; i < 2; i  )
                for (int j = 0; j < 2; j  ) {
                    float tmp = NeuralNetwork.ReLU(conv[l * ht * wt   (2 * r   i) * wt   2 * c   j]);
                    if(val < tmp) val = tmp;
                }
            pool[(l * ht * wt / 4)   (r * wt / 2)   c] = Math.round(100.00f * val) / 100.00f;
        }
    };
    poolKernel.setExplicit(true);
    poolKernel.put(params);
    poolKernel.put(conv);
    poolKernel.put(pool);
    poolKernel.execute(poolRange);
    poolKernel.get(pool);
    for(int poolL = 0; poolL < 4; poolL  )
        for(int poolR = 0; poolR < rows / 2; poolR  )
            System.arraycopy(pool, (poolL * rows * columns / 4)   (poolR * columns / 2), poolLayer[poolL][poolR], 0, columns / 2);

Also, I came to the conclusion that I don't need LeakyReLU - regular ReLU is perfectly fine! That, being said, I think the topic is more or less closed. I hope someone can learn from my rough path :D