Custom trained Object Detection model, can't read predicted TensorflowJS tensors array elements-CodePudding

I succeed to train and convert object detection tensorflow saved model to tensorflowJS on my custom data using this post https://blog.tensorflow.org/2021/01/custom-object-detection-in-browser.html , this notebook https://colab.research.google.com/drive/1MdzgmdYJk947sXyls45V7auMPttHelBZ?usp=sharing and using this raactJS app template https://github.com/hugozanini/TFJS-object-detection , but can not figure out tensorflowJS model's predicted tensor array elements to draw squares on video window in react app in JS, got below issue.

https://colab.research.google.com/drive/1MdzgmdYJk947sXyls45V7auMPttHelBZ?usp=sharing

I can't understand and read the array output of TensorflowJS model, so can not use it to display result.

In prediction section of the below function, how to know which prediction tensor array element is for boxes, scores or classes?

renderPredictions = predictions => {
    const ctx = this.canvasRef.current.getContext("2d");
    ctx.clearRect(0, 0, ctx.canvas.width, ctx.canvas.height);

    // Font options.
    const font = "16px sans-serif";
    ctx.font = font;
    ctx.textBaseline = "top";

    //Getting predictions
    //HERE how to know which prediction element is for boxes, scores or classes?
    const boxes = predictions[4].arraySync();
    const scores = predictions[5].arraySync();
    const classes = predictions[6].dataSync();
    const detections = this.buildDetectedObjects(scores, threshold,
                                    boxes, classes, classesDir);

    detections.forEach(item => {
      const x = item['bbox'][0];
      const y = item['bbox'][1];
      const width = item['bbox'][2];
      const height = item['bbox'][3];

      // Draw the bounding box.
      ctx.strokeStyle = "#00FFFF";
      ctx.lineWidth = 4;
      ctx.strokeRect(x, y, width, height);

      // Draw the label background.
      ctx.fillStyle = "#00FFFF";
      const textWidth = ctx.measureText(item["label"]   " "   (100 * item["score"]).toFixed(2)   "%").width;
      const textHeight = parseInt(font, 10); // base 10
      ctx.fillRect(x, y, textWidth   4, textHeight   4);
    });

Output of predictions variable - All root elements and it's child elements expanded:

predictions.forEach(i=>{console.log(i.arraySync())})

Element 0:
[100]
0: 100
length: 1
[[Prototype]]: Array(0)

Element 1:
[Array(100)]
0: Array(100)
0: 1878
1: 1851
2: 1021
3: 1520
4: 1015
5: 1214
6: 1576
7: 973
.
.
.
99: 164
length: 100
[[Prototype]]: Array(0)
length: 1
[[Prototype]]: Array(0)

Element 2:
[Array(100)]
0: Array(100)
0: 0.32437506318092346
1: 0.301730751991272
2: 0.29111218452453613
3: 0.28852424025535583
4: 0.270997017621994
5: 0.2643387019634247
6: 0.26040562987327576
7: 0.26039886474609375
8: 0.25926679372787476
.
.
.
99: 0.07848591357469559
length: 100
[[Prototype]]: Array(0)
length: 1
[[Prototype]]: Array(0)

Element 3:
[Array(1917)]
0: Array(1917)
[0 … 99]
[100 … 199]
[200 … 299]
[300 … 399]
[400 … 499]
[500 … 599]
[600 … 699]
[700 … 799]
[800 … 899]
[900 … 999]
[1000 … 1099]
[1100 … 1199]
[1200 … 1299]
[1300 … 1399]
[1400 … 1499]
[1500 … 1599]
[1600 … 1699]
[1700 … 1799]
[1800 … 1899]
[1900 … 1916]
length: 1917
[[Prototype]]: Array(0)
length: 1
[[Prototype]]: Array(0)

Element 4:
[Array(1917)]
0: Array(1917)
[0 … 99]
[100 … 199]
[200 … 299]
[300 … 399]
[400 … 499]
[500 … 599]
[600 … 699]
[700 … 799]
[800 … 899]
[900 … 999]
[1000 … 1099]
[1100 … 1199]
[1200 … 1299]
[1300 … 1399]
[1400 … 1499]
[1500 … 1599]
[1600 … 1699]
[1700 … 1799]
[1800 … 1899]
[1900 … 1916]
length: 1917
[[Prototype]]: Array(0)
length: 1
[[Prototype]]: Array(0)

Element 5:
[Array(100)]
0: Array(100)
0: (4) [0.6603810787200928, 0.05048862099647522, 1, 0.9434524774551392]
1: (4) [0.23983880877494812, 0, 0.8556625843048096, 0.4294479489326477]
2: (4) [0.8396192789077759, 0.7738925218582153, 1, 1]
3: (4) [0.5717893242835999, 0, 0.9534353613853455, 0.4774531126022339]
4: (4) [0.8307931423187256, 0.6522707939147949, 1, 0.9869933128356934]
5: (4) [0.05821444094181061, 0, 0.4731786847114563, 0.372313916683197]
6: (4) [0.7183740139007568, 0, 0.9927113056182861, 0.49393266439437866]
7: (4) [0.8234801888465881, 0, 1, 0.22610855102539062]
8: (4) [0.8305549621582031, 0.022644445300102234, 1, 0.363572359085083]
9: (4) [0.830806314945221, 0.12703362107276917, 1, 0.46965447068214417]
10: (4) [0.8307640552520752, 0.5480412244796753, 1, 0.8905272483825684]
11: (4) [0.8307292461395264, 0.2323085516691208, 1, 0.5748147368431091]
12: (4) [0.8307291269302368, 0.44283488392829895, 1, 0.7853409051895142]
.
.
.
99: (4) [0.02437518537044525, 0.7758948802947998, 0.2996327877044678, 0.9475241899490356]
length: 100
[[Prototype]]: Array(0)
length: 1
[[Prototype]]: Array(0)

Element 6:
[Array(100)]
0: Array(100)
0: (2) [0.008746352046728134, 0.32437506318092346]
1: (2) [0.008580721914768219, 0.301730751991272]
2: (2) [0.0071436576545238495, 0.29111218452453613]
3: (2) [0.00598490796983242, 0.28852424025535583]
4: (2) [0.0071937208995223045, 0.270997017621994]
5: (2) [0.005746915470808744, 0.2643387019634247]
6: (2) [0.005973200313746929, 0.26040562987327576]
7: (2) [0.007103783078491688, 0.26039886474609375]
8: (2) [0.007148009724915028, 0.25926679372787476]
.
.
.
99: (2) [0.005199342034757137, 0.07848591357469559]
length: 100
[[Prototype]]: Array(0)
length: 1
[[Prototype]]: Array(0)

Element 7:
[Array(100)]
0: Array(100)
0: 1
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
7: 1
8: 1
9: 1
.
.
.
99: 1
length: 100
[[Prototype]]: Array(0)
length: 1
[[Prototype]]: Array(0)

Full js code below

import React from "react";
import ReactDOM from "react-dom";
import * as tf from '@tensorflow/tfjs';
import {loadGraphModel} from '@tensorflow/tfjs-converter';
import "./styles.css";
tf.setBackend('webgl');

const threshold = 0.75;

async function load_model() {
    // It's possible to load the model locally or from a repo
    // You can choose whatever IP and PORT you want in the "http://127.0.0.1:8080/model.json" just set it before in your https server
    const model = await loadGraphModel("http://127.0.0.1:8080/model.json");
    //const model = await loadGraphModel("https://raw.githubusercontent.com/hugozanini/TFJS-object-detection/master/models/kangaroo-detector/model.json");
    return model;
  }

let classesDir = {
    0: {
        name: 'connection',
        id: 0
    }
}

class App extends React.Component {
  videoRef = React.createRef();
  canvasRef = React.createRef();


  componentDidMount() {
    if (navigator.mediaDevices && navigator.mediaDevices.getUserMedia) {
      const webCamPromise = navigator.mediaDevices
        .getUserMedia({
          audio: false,
          video: {
            facingMode: "user"
          }
        })
        .then(stream => {
          window.stream = stream;
          this.videoRef.current.srcObject = stream;
          return new Promise((resolve, reject) => {
            this.videoRef.current.onloadedmetadata = () => {
              resolve();
            };
          });
        });

      const modelPromise = load_model();

      Promise.all([modelPromise, webCamPromise])
        .then(values => {
          this.detectFrame(this.videoRef.current, values[0]);
        })
        .catch(error => {
          console.error(error);
        });
    }
  }

    detectFrame = (video, model) => {
        tf.engine().startScope();
        model.executeAsync(this.process_input(video)).then(predictions => {
        this.renderPredictions(predictions, video);
        requestAnimationFrame(() => {
          this.detectFrame(video, model);
        });
        tf.engine().endScope();
      });
  };

  process_input(video_frame){
    const tfimg = tf.browser.fromPixels(video_frame).toInt();
    const expandedimg = tfimg.transpose([0,1,2]).expandDims();
    return expandedimg;
  };

  buildDetectedObjects(scores, threshold, boxes, classes, classesDir) {
    const detectionObjects = []
    var video_frame = document.getElementById('frame');

    scores.forEach((score, i) => {
      if (score > threshold) {
        const bbox = [];
        const minY = boxes[0][i][0] * video_frame.offsetHeight;
        const minX = boxes[0][i][1] * video_frame.offsetWidth;
        const maxY = boxes[0][i][2] * video_frame.offsetHeight;
        const maxX = boxes[0][i][3] * video_frame.offsetWidth;
        bbox[0] = minX;
        bbox[1] = minY;
        bbox[2] = maxX - minX;
        bbox[3] = maxY - minY;
        detectionObjects.push({
          class: classes[i],
          label: classesDir[classes[i]].name,
          score: score.toFixed(4),
          bbox: bbox
        })
      }
    })
    return detectionObjects
  }

  renderPredictions = predictions => {
    const ctx = this.canvasRef.current.getContext("2d");
    ctx.clearRect(0, 0, ctx.canvas.width, ctx.canvas.height);

    // Font options.
    const font = "16px sans-serif";
    ctx.font = font;
    ctx.textBaseline = "top";

    //Getting predictions
    // HERE how to know which prediction element is for boxes, scores or classes?
    const boxes = predictions[4].arraySync();
    const scores = predictions[5].arraySync();
    const classes = predictions[6].dataSync();
    const detections = this.buildDetectedObjects(scores, threshold,
                                    boxes, classes, classesDir);

    detections.forEach(item => {
      const x = item['bbox'][0];
      const y = item['bbox'][1];
      const width = item['bbox'][2];
      const height = item['bbox'][3];

      // Draw the bounding box.
      ctx.strokeStyle = "#00FFFF";
      ctx.lineWidth = 4;
      ctx.strokeRect(x, y, width, height);

      // Draw the label background.
      ctx.fillStyle = "#00FFFF";
      const textWidth = ctx.measureText(item["label"]   " "   (100 * item["score"]).toFixed(2)   "%").width;
      const textHeight = parseInt(font, 10); // base 10
      ctx.fillRect(x, y, textWidth   4, textHeight   4);
    });

    detections.forEach(item => {
      const x = item['bbox'][0];
      const y = item['bbox'][1];

      // Draw the text last to ensure it's on top.
      ctx.fillStyle = "#000000";
      ctx.fillText(item["label"]   " "   (100*item["score"]).toFixed(2)   "%", x, y);
    });
  };

  render() {
    return (
      <div>
        <h1>Real-Time Object Detection</h1>
        <h3>MobileNetV2</h3>
        <video
          style={{height: '600px', width: "500px"}}
          className="size"
          autoPlay
          playsInline
          muted
          ref={this.videoRef}
          width="600"
          height="500"
          id="frame"
        />
        <canvas
          className="size"
          ref={this.canvasRef}
          width="600"
          height="500"
        />
      </div>
    );
  }
}

const rootElement = document.getElementById("root");
ReactDOM.render(<App />, rootElement);

Thanks

CodePudding user response：

It will depend on how you labeled your data, but usually, it is something like an array, containing in each element:

x_center,y_center,height,width,class,score

x0,y0,xf,yf,class,score 
(being x0,y0 the left upper corner and xf,yf right bottom corners of bounding box)

in any case, you will have to multiply by:

(x0,xf,x_center etc..)*runtime_image_width
(y0,yf,y_center etc..)*runtime_image_height

as predicted positions are relative to proportion of the image and not in absolute pixel values

long story short: you will have to either test different combinations or presume that it will follow exactly labeled format. But the example you gave is pretty straightforward on:

const boxes = predictions[4].arraySync();
const scores = predictions[5].arraySync();
const classes = predictions[6].dataSync();

CodePudding user response：

I figured it out by looking at the shapes and sizes of the tensors and by looking at the detection squares drawn on display:
const boxes = predictions[2].arraySync(); const scores = predictions[7].arraySync(); const classes = predictions[1].dataSync(); but how to figure it out only by reading these output tensors. Is there anyway we can make these tensors human readable?