How to use all GPUs in SageMaker real-time inference?-CodePudding

I have deployed a model on real-time inference in a single gpu instance, it works fine.

Now I want to use a multiple GPUs to decrease the inference time, what do I need to change in my inference.py to make it work?

Here is some of my code:

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def model_fn(model_dir):
    logger.info("Loading first model...")
    model = Model().to(DEVICE)
    with open(os.path.join(model_dir, "checkpoint.pth"), "rb") as f:
        model.load_state_dict(torch.load(f, map_location=DEVICE)['state_dict'])
    model = model.eval()
    
    logger.info("Loading second model...")
    model_2 = Model_2()
    model_2.to(DEVICE)
    checkpoint = torch.load('checkpoint_2.pth', map_location=DEVICE)
    model_2(remove_prefix_state_dict(checkpoint['state_dict']), strict=True)
    model_2 = model_2()
    
    logger.info('Done loading models')
    return {'first_model': model, 'second_model': model_2}

def input_fn(request_body, request_content_type):
    assert request_content_type=='application/json'
    url = json.loads(request_body)['url']
    save_name = json.loads(request_body)['save_name']
    logger.info(f'Image url: {url}')
    img = Image.open(requests.get(url, stream=True).raw).convert('RGB')
    w, h = img.size
    input_tensor = preprocess(img)
    input_batch = input_tensor.unsqueeze(0).to(DEVICE)
    logger.info('Image ready to predict!')
    return {'tensor':input_batch, 'w':w,'h':h,'image':img, 'save_name':save_name}

def predict_fn(input_object, model):
    data = input_object['tensor']
    logger.info('Generating prediction based on the input image')
    model_1 = model['first_model']
    model_2 = model['second_model']
    d0, d1, d2, d3, d4, d5, d6 = model_1(data)
    torch.cuda.empty_cache()
    mask = torch.argmax(d0[0], axis=0).cpu().numpy()
    mask = np.where(mask==2, 255, mask)
    mask = np.where(mask==1, 128, mask)
    img = input_object['image']
    final_image = Image.fromarray(mask).resize((input_object['w'], input_object['h'])).convert('L')
    img = np.array(img)[:,:,::-1]
    final_image = np.array(final_image)
    image_dict = to_dict(img, final_image)
    final_image = model_2_process(model_2, image_dict)
    torch.cuda.empty_cache()
    
    return {"final_ouput": final_image, 'image':input_object['image'], 'save_name': input_object['save_name']}

I was thinking that maybe with torch multiprocessing, any tips?

CodePudding user response：

You must use torch.nn.DataParallel or torch.nn.parallel.DistributedDataParallel (read "Multi-GPU Examples" and "Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel").

You must call the function by passing at least these three parameters:

module (Module) – module to be parallelized (your model)

device_ids (list of python:int or torch.device) – CUDA devices.

For single-device modules, device_ids can contain exactly one device id, which represents the only CUDA device where the input module corresponding to this process resides. Alternatively, device_ids can also be None.

For multi-device modules and CPU modules, device_ids must be None. When device_ids is None for both cases, both the input data for the forward pass and the actual module must be placed on the correct device. (default: None)

output_device (int or torch.device) – Device location of output for single-device CUDA modules.

For multi-device modules and CPU modules, it must be None, and the module itself dictates the output location. (default: device_ids[0] for single-device modules)

for example:

from torch.nn.parallel import DistributedDataParallel
model = DistributedDataParallel(model, device_ids=[i], output_device=i)