How to capture a metal shader frame programmatically?-CodePudding

I have a shader that I want to debug. It just runs some calculations and is done very fast. It doesn't do what I expected and I need to debug some variables. After some digging, I found out that you can somehow trigger the debugger programmatically as well as in the GUI.

I looked into the documentation on programmatic frame capture by Apple, but found it inconclusive.

Like this part:

func triggerProgrammaticCapture() {
    let captureManager = MTLCaptureManager.shared()
    let captureDescriptor = MTLCaptureDescriptor()
    captureDescriptor.captureObject = self.device
    do {
        try captureManager.startCapture(with: captureDescriptor)
    }
    catch
    {
        fatalError("error when trying to capture: \(error)")
    }
}

func runMetalCommands() {
    let commandBuffer = commandQueue.makeCommandBuffer()!
    // Do Metal work
    commandBuffer.commit()
    let captureManager = MTLCaptureManager.shared()
    captureManager.stopCapture()
}

I suspect runMetalCommands() is the function that handles the shader. Though they for some reason never call triggerProgrammaticCapture(), but the capture starts? How is this supposed do work?

Also, is // Do Metal work supposed to be the part where I define my device or after? I don't understand what they are doing there and would love to understand how to capture a frame of my GPU for debugging variables in my shader.

Some background - this is what working on:

It calculates Dijkstra in parallel, one start node per thread.

The result isn't correct though and I suspect the variable next_node to me off somehow. To look into it, I need to capture a GPU frame.

import MetalKit

let gra: [[Int32]] = [ [0, 6, -1, 1, -1],
                       [6, 0, 5, 2, 2],
                       [-1, 5, 0, -1, 5],
                       [1, 2, -1, 0, 1],
                       [-1, 2, 5, 1, 0]
]


func flatten(arr: [[Int32]])->[Int32] {
    var result: [Int32] = []
    for i in arr {
        for j in i {
            result.append(j)
        }
    }
    return result
}

func gpuProcess(graph: [[Int32]]) {
    
    let len = graph.count           // value of 5
    let size = len * graph[0].count // value of 25
    
    
    print(size)
    // GPU we want to use
    let device = MTLCreateSystemDefaultDevice()

    // Fifo queue for sending commands to the gpu
    let commandQueue = device?.makeCommandQueue()

    // The library for getting our metal functions
    let gpuFunctionLibrary = device?.makeDefaultLibrary()

    // Grab gpu function
    let dijkstraFunction = gpuFunctionLibrary?.makeFunction(name: "dijkstra")

    var dijkstraPipelineState: MTLComputePipelineState!
    do {
        dijkstraPipelineState = try device?.makeComputePipelineState(function: dijkstraFunction!)
    } catch {
      print(error)
    }
    
    let array_flattend = flatten(arr: graph)

    // Create buffers to be sent to the gpu from our array
    let graphBuff = device?.makeBuffer(bytes: array_flattend,
                                      length: MemoryLayout<Int32>.stride * size,
                                      options: .storageModeShared)
    
    let previousNodesBuff = device?.makeBuffer(length: MemoryLayout<Int32>.stride * size,
                                        options: .storageModeShared)
    
    let visitedNodesBuff = device?.makeBuffer(length: MemoryLayout<Int32>.stride * size,
                                        options: .storageModeShared)
    
    let resultBuff = device?.makeBuffer(length: MemoryLayout<Int32>.stride * size,
                                        options: .storageModeShared)

    // Create the buffer to be sent to the command queue
    let commandBuffer = commandQueue?.makeCommandBuffer()

    // Create an encoder to set values on the compute function
    let commandEncoder = commandBuffer?.makeComputeCommandEncoder()
    commandEncoder?.setComputePipelineState(dijkstraPipelineState)

    // Set the parameters of our gpu function
    commandEncoder?.setBuffer(graphBuff, offset: 0, index: 0)
    commandEncoder?.setBuffer(previousNodesBuff, offset: 0, index: 1)
    commandEncoder?.setBuffer(visitedNodesBuff, offset: 0, index: 2)
    commandEncoder?.setBuffer(resultBuff, offset: 0, index: 3)
    
    // Set parameters for our iterator
    var numNodes = len
    commandEncoder?.setBytes(&numNodes, length: MemoryLayout.size(ofValue: numNodes), index: 4)

    // Figure out how many threads we need to use for our operation
    let threadsPerGrid = MTLSize(width: size, height: 1, depth: 1)
    let maxThreadsPerThreadgroup = dijkstraPipelineState.maxTotalThreadsPerThreadgroup // 1024
    let threadsPerThreadgroup = MTLSize(width: maxThreadsPerThreadgroup, height: 1, depth: 1)
    commandEncoder?.dispatchThreads(threadsPerGrid,
                                    threadsPerThreadgroup: threadsPerThreadgroup)

    // Tell encoder that it is done encoding.  Now we can send this off to the gpu.
    commandEncoder?.endEncoding()

    // Push this command to the command queue for processing
    commandBuffer?.commit()

    // Wait until the gpu function completes before working with any of the data
    commandBuffer?.waitUntilCompleted()

    // Get the pointer to the beginning of our data
    var resultBufferPointer = resultBuff?.contents().bindMemory(to: Int32.self,
                                                                capacity: MemoryLayout<Int32>.stride * size)
    
    var resultArray = [Int32]()
    for _ in 0..<size {
        resultArray.append(Int32(resultBufferPointer!.pointee))
        resultBufferPointer = resultBufferPointer?.advanced(by: 1)
    }
    
    print()
    
    for i in 0..<len {
        print(graph[i])
    }
    
    print()
    
    for i in 0..<len {
        let start = i * len
        let stop = start   len
        print(resultArray[start..<stop])
    }
    
    
    
}

// Call function
gpuProcess(graph: gra)

#include <metal_stdlib>
using namespace metal;

kernel void dijkstra(constant int *graph                  [[ buffer(0) ]],
                     device   int *previousNodesArray     [[ buffer(1) ]],
                     device   int *visitedNodesArray      [[ buffer(2) ]],
                     device   int *resultArray            [[ buffer(3) ]],
                     constant uint &size                  [[ buffer(4) ]],
                     uint index                           [[ thread_position_in_grid ]]){
    
    
    
    // Early out if gid is out of array boudns
    if(index >= size)
    {
        return;
    }
      
    // Get thread index
      uint start_node = index;

      // Start index in all arrays depending on the start node
      uint local_start = start_node * size;

      // First step
      for (uint i = 0; i < size; i  ) {
        if (graph[local_start   i] != -1) {
          resultArray[local_start   i] = graph[local_start   i];
          previousNodesArray[local_start   i] = start_node;
        }
      }
      visitedNodesArray[local_start   start_node] = 1;

      // Process rest of graph
      for (uint _ = 0; _ < size; _  ) {
        int next_node = -1;

        // Determine the next node
        for (uint i = 0; i < size; i  ) {
          // Already visited
          if (visitedNodesArray[local_start   i] == 1)
            continue;

          // Next is still at starting value
          if (next_node == -1)
            next_node = i;
          else if (resultArray[local_start   i] <
                   resultArray[local_start   next_node])
            next_node = i;
        }

        // No unvisited node found
        if (next_node == -1)
          return;

        // Look for shorter paths
        int current_node = next_node;
        for (uint target = 0; target < size; target  ) {
          // Already visited
          if (visitedNodesArray[local_start   target] == 1)
            continue;

          // No path to this node
          if (graph[current_node * size   target] == -1)
            continue;

          // Calculate path length from start node to unvisited neighbor node
          int path_length = resultArray[local_start   current_node]  
                            graph[current_node * size   target];

          // Shorter path found
          if (path_length < resultArray[local_start   target]) {
            resultArray[local_start   target] = path_length;
            previousNodesArray[local_start   target] = current_node;
          }
        }

        // Set current node as visited
        visitedNodesArray[local_start   current_node] = 1;
      }
}

Edit

For anyone finding this, the reason why it didn't work is that my resultBuff needs to be initialized with a bunch of big numbers in it, instead of empty. Add this to your code for Dijkstra to give you an the correct adjacency matrix.

let gra_max: [[Int32]] = [ [999, 999, 999, 999, 999],
                       [999, 999, 999, 999, 999],
                       [999, 999, 999, 999, 999],
                       [999, 999, 999, 999, 999],
                       [999, 999, 999, 999, 999]
]


    let resultBuff = device?.makeBuffer(bytes: gra_max,
                                        length: MemoryLayout<Int32>.stride * size,
                                        options: .storageModeShared)

CodePudding user response：

First thing you need to know is that, GPU capture works in terms of command buffers. Command buffer is the smallest "thing" it can capture.

To start a capture programmatically, you need to create a capture descriptor.

MTLCaptureDescriptor lets you scope your capture to a device, a command queue, or a scope.

Scopes are used to define a range of commands that you want to capture. This is useful for repeated work, for example, if you have a lot of stuff happening in your engine every frame, you could separate it into scopes to be able to capture and see only the relevant parts. Scopes also show up in Xcode when you press the Metal button.

Device or queue captured scopes will let you capture all the command buffers that are created between startCapture and stopCapture call from either all command buffers created from all queues or from just a single queue respectively.

You can also specify a destination on your capture descriptor, either Xcode or a gputrace document.

For your case, it would look something like this:

import MetalKit

// ...

func gpuProcess(graph: [[Int32]]) {
    // ...

    // GPU we want to use, use MTLCopyAllDevices in CLI utilities
    let device = MTLCopyAllDevices().first!

    let captureDescriptor = MTLCaptureDescriptor()
    captureDescriptor.captureObject = device
    // destination is developerTools by default

    try? MTLCaptureManager.shared().startCapture(with: captureDescriptor)

    // Fifo queue for sending commands to the gpu
    let commandQueue = device?.makeCommandQueue()

    // Rest of the code ...

    if MTLCaptureManager.shared().isCapturing {
        MTLCaptureManager.shared().stopCapture()
    }

    // Print your results, no GPU work here so no need to capture 
    // ...
}

// Call function
gpuProcess(graph: gra)

Side note: don't use MTLCreateSystemDefaultDevice() in command line utilities. It only works because you are probably running it from Xcode where the View debugging is enabled by default in a scheme. Actual command line applications can't use MTLCreateSystemDefaultDevice and should rely on MTLCopyAllDevices.