I have a shader that I want to debug. It just runs some calculations and is done very fast. It doesn't do what I expected and I need to debug some variables. After some digging, I found out that you can somehow trigger the debugger programmatically as well as in the GUI.
I looked into the documentation on programmatic frame capture by Apple, but found it inconclusive.
Like this part:
func triggerProgrammaticCapture() {
let captureManager = MTLCaptureManager.shared()
let captureDescriptor = MTLCaptureDescriptor()
captureDescriptor.captureObject = self.device
do {
try captureManager.startCapture(with: captureDescriptor)
}
catch
{
fatalError("error when trying to capture: \(error)")
}
}
func runMetalCommands() {
let commandBuffer = commandQueue.makeCommandBuffer()!
// Do Metal work
commandBuffer.commit()
let captureManager = MTLCaptureManager.shared()
captureManager.stopCapture()
}
I suspect runMetalCommands()
is the function that handles the shader. Though they for some reason never call triggerProgrammaticCapture()
, but the capture starts? How is this supposed do work?
Also, is // Do Metal work
supposed to be the part where I define my device
or after? I don't understand what they are doing there and would love to understand how to capture a frame of my GPU for debugging variables in my shader.
Some background - this is what working on:
It calculates Dijkstra in parallel, one start node per thread.
The result isn't correct though and I suspect the variable next_node
to me off somehow. To look into it, I need to capture a GPU frame.
import MetalKit
let gra: [[Int32]] = [ [0, 6, -1, 1, -1],
[6, 0, 5, 2, 2],
[-1, 5, 0, -1, 5],
[1, 2, -1, 0, 1],
[-1, 2, 5, 1, 0]
]
func flatten(arr: [[Int32]])->[Int32] {
var result: [Int32] = []
for i in arr {
for j in i {
result.append(j)
}
}
return result
}
func gpuProcess(graph: [[Int32]]) {
let len = graph.count // value of 5
let size = len * graph[0].count // value of 25
print(size)
// GPU we want to use
let device = MTLCreateSystemDefaultDevice()
// Fifo queue for sending commands to the gpu
let commandQueue = device?.makeCommandQueue()
// The library for getting our metal functions
let gpuFunctionLibrary = device?.makeDefaultLibrary()
// Grab gpu function
let dijkstraFunction = gpuFunctionLibrary?.makeFunction(name: "dijkstra")
var dijkstraPipelineState: MTLComputePipelineState!
do {
dijkstraPipelineState = try device?.makeComputePipelineState(function: dijkstraFunction!)
} catch {
print(error)
}
let array_flattend = flatten(arr: graph)
// Create buffers to be sent to the gpu from our array
let graphBuff = device?.makeBuffer(bytes: array_flattend,
length: MemoryLayout<Int32>.stride * size,
options: .storageModeShared)
let previousNodesBuff = device?.makeBuffer(length: MemoryLayout<Int32>.stride * size,
options: .storageModeShared)
let visitedNodesBuff = device?.makeBuffer(length: MemoryLayout<Int32>.stride * size,
options: .storageModeShared)
let resultBuff = device?.makeBuffer(length: MemoryLayout<Int32>.stride * size,
options: .storageModeShared)
// Create the buffer to be sent to the command queue
let commandBuffer = commandQueue?.makeCommandBuffer()
// Create an encoder to set values on the compute function
let commandEncoder = commandBuffer?.makeComputeCommandEncoder()
commandEncoder?.setComputePipelineState(dijkstraPipelineState)
// Set the parameters of our gpu function
commandEncoder?.setBuffer(graphBuff, offset: 0, index: 0)
commandEncoder?.setBuffer(previousNodesBuff, offset: 0, index: 1)
commandEncoder?.setBuffer(visitedNodesBuff, offset: 0, index: 2)
commandEncoder?.setBuffer(resultBuff, offset: 0, index: 3)
// Set parameters for our iterator
var numNodes = len
commandEncoder?.setBytes(&numNodes, length: MemoryLayout.size(ofValue: numNodes), index: 4)
// Figure out how many threads we need to use for our operation
let threadsPerGrid = MTLSize(width: size, height: 1, depth: 1)
let maxThreadsPerThreadgroup = dijkstraPipelineState.maxTotalThreadsPerThreadgroup // 1024
let threadsPerThreadgroup = MTLSize(width: maxThreadsPerThreadgroup, height: 1, depth: 1)
commandEncoder?.dispatchThreads(threadsPerGrid,
threadsPerThreadgroup: threadsPerThreadgroup)
// Tell encoder that it is done encoding. Now we can send this off to the gpu.
commandEncoder?.endEncoding()
// Push this command to the command queue for processing
commandBuffer?.commit()
// Wait until the gpu function completes before working with any of the data
commandBuffer?.waitUntilCompleted()
// Get the pointer to the beginning of our data
var resultBufferPointer = resultBuff?.contents().bindMemory(to: Int32.self,
capacity: MemoryLayout<Int32>.stride * size)
var resultArray = [Int32]()
for _ in 0..<size {
resultArray.append(Int32(resultBufferPointer!.pointee))
resultBufferPointer = resultBufferPointer?.advanced(by: 1)
}
print()
for i in 0..<len {
print(graph[i])
}
print()
for i in 0..<len {
let start = i * len
let stop = start len
print(resultArray[start..<stop])
}
}
// Call function
gpuProcess(graph: gra)
#include <metal_stdlib>
using namespace metal;
kernel void dijkstra(constant int *graph [[ buffer(0) ]],
device int *previousNodesArray [[ buffer(1) ]],
device int *visitedNodesArray [[ buffer(2) ]],
device int *resultArray [[ buffer(3) ]],
constant uint &size [[ buffer(4) ]],
uint index [[ thread_position_in_grid ]]){
// Early out if gid is out of array boudns
if(index >= size)
{
return;
}
// Get thread index
uint start_node = index;
// Start index in all arrays depending on the start node
uint local_start = start_node * size;
// First step
for (uint i = 0; i < size; i ) {
if (graph[local_start i] != -1) {
resultArray[local_start i] = graph[local_start i];
previousNodesArray[local_start i] = start_node;
}
}
visitedNodesArray[local_start start_node] = 1;
// Process rest of graph
for (uint _ = 0; _ < size; _ ) {
int next_node = -1;
// Determine the next node
for (uint i = 0; i < size; i ) {
// Already visited
if (visitedNodesArray[local_start i] == 1)
continue;
// Next is still at starting value
if (next_node == -1)
next_node = i;
else if (resultArray[local_start i] <
resultArray[local_start next_node])
next_node = i;
}
// No unvisited node found
if (next_node == -1)
return;
// Look for shorter paths
int current_node = next_node;
for (uint target = 0; target < size; target ) {
// Already visited
if (visitedNodesArray[local_start target] == 1)
continue;
// No path to this node
if (graph[current_node * size target] == -1)
continue;
// Calculate path length from start node to unvisited neighbor node
int path_length = resultArray[local_start current_node]
graph[current_node * size target];
// Shorter path found
if (path_length < resultArray[local_start target]) {
resultArray[local_start target] = path_length;
previousNodesArray[local_start target] = current_node;
}
}
// Set current node as visited
visitedNodesArray[local_start current_node] = 1;
}
}
Edit
For anyone finding this, the reason why it didn't work is that my resultBuff needs to be initialized with a bunch of big numbers in it, instead of empty. Add this to your code for Dijkstra to give you an the correct adjacency matrix.
let gra_max: [[Int32]] = [ [999, 999, 999, 999, 999],
[999, 999, 999, 999, 999],
[999, 999, 999, 999, 999],
[999, 999, 999, 999, 999],
[999, 999, 999, 999, 999]
]
let resultBuff = device?.makeBuffer(bytes: gra_max,
length: MemoryLayout<Int32>.stride * size,
options: .storageModeShared)
CodePudding user response:
First thing you need to know is that, GPU capture works in terms of command buffers. Command buffer is the smallest "thing" it can capture.
To start a capture programmatically, you need to create a capture descriptor.
MTLCaptureDescriptor
lets you scope your capture to a device, a command queue, or a scope.
Scopes are used to define a range of commands that you want to capture. This is useful for repeated work, for example, if you have a lot of stuff happening in your engine every frame, you could separate it into scopes to be able to capture and see only the relevant parts. Scopes also show up in Xcode when you press the Metal button.
Device or queue captured scopes will let you capture all the command buffers that are created between startCapture
and stopCapture
call from either all command buffers created from all queues or from just a single queue respectively.
You can also specify a destination on your capture descriptor, either Xcode or a gputrace document.
For your case, it would look something like this:
import MetalKit
// ...
func gpuProcess(graph: [[Int32]]) {
// ...
// GPU we want to use, use MTLCopyAllDevices in CLI utilities
let device = MTLCopyAllDevices().first!
let captureDescriptor = MTLCaptureDescriptor()
captureDescriptor.captureObject = device
// destination is developerTools by default
try? MTLCaptureManager.shared().startCapture(with: captureDescriptor)
// Fifo queue for sending commands to the gpu
let commandQueue = device?.makeCommandQueue()
// Rest of the code ...
if MTLCaptureManager.shared().isCapturing {
MTLCaptureManager.shared().stopCapture()
}
// Print your results, no GPU work here so no need to capture
// ...
}
// Call function
gpuProcess(graph: gra)
Side note: don't use MTLCreateSystemDefaultDevice()
in command line utilities. It only works because you are probably running it from Xcode where the View debugging is enabled by default in a scheme. Actual command line applications can't use MTLCreateSystemDefaultDevice
and should rely on MTLCopyAllDevices
.