pypideep-learning95% confidence\u2191 116

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming training

Full error message

I saved a checkpoint while training on gpu. After reloading the checkpoint and continue training I get the following error:

Traceback (most recent call last):
  File "main.py", line 140, in <module>
    train(model,optimizer,train_loader,val_loader,criteria=args.criterion,epoch=epoch,batch=batch)
  File "main.py", line 71, in train
    optimizer.step()
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/optim/sgd.py", line 106, in step
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

My training code is as follows:

def train(model,optimizer,train_loader,val_loader,criteria,epoch=0,batch=0):
    batch_count = batch
    if criteria == 'l1':
        criterion = L1_imp_Loss()
    elif criteria == 'l2':
        criterion = L2_imp_Loss()
    if args.gpu and torch.cuda.is_available():
        model.cuda()
        criterion = criterion.cuda()

    print(f'{datetime.datetime.now().time().replace(microsecond=0)} Starting to train..')
    
    while epoch <= args.epochs-1:
        print(f'********{datetime.datetime.now().time().replace(microsecond=0)} Epoch#: {epoch+1} / {args.epochs}')
        model.train()
        interval_loss, total_loss= 0,0
        for i , (input,target) in enumerate(train_loader):
            batch_count += 1
            if args.gpu and torch.cuda.is_available():
                input, target = input.cuda(), target.cuda()
            input, target = input.float(), target.float()
            pred = model(input)
            loss = criterion(pred,target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            ....

The saving process happened after finishing each epoch:

torch.save({'epoch': epoch,'batch':batch_count,'model_state_dict': model.state_dict(),'optimizer_state_dict':
                    optimizer.state_dict(),'loss': total_loss/len(train_loader),'train_set':args.train_set,'val_set':args.val_set,'args':args}, f'{args.weights_dir}/FastDepth_Final.pth')

I can't figure why I get this error.
args.gpu == True, and I'm passing the model, all data, and loss function to cuda, somehow there is still a tensor on cpu, could anyone figure out what's wrong?

Thanks.

Solutionsource: stackoverflow \u2197

There might be an issue with the device parameters are on: If you need to move a model to GPU via .cuda() , please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call. In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.

API access

Get this solution programmatically \u2014 free, no authentication.

curl https://depscope.dev/api/error/2ee0ce0868b6256af1fe2db87b4f9b94aafec253062b9e15850225300600a016

hash \u00b7 2ee0ce0868b6256af1fe2db87b4f9b94aafec253062b9e15850225300600a016