RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming training
Full error message
I saved a checkpoint while training on gpu. After reloading the checkpoint and continue training I get the following error:
Traceback (most recent call last):
File "main.py", line 140, in <module>
train(model,optimizer,train_loader,val_loader,criteria=args.criterion,epoch=epoch,batch=batch)
File "main.py", line 71, in train
optimizer.step()
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/optim/sgd.py", line 106, in step
buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
My training code is as follows:
def train(model,optimizer,train_loader,val_loader,criteria,epoch=0,batch=0):
batch_count = batch
if criteria == 'l1':
criterion = L1_imp_Loss()
elif criteria == 'l2':
criterion = L2_imp_Loss()
if args.gpu and torch.cuda.is_available():
model.cuda()
criterion = criterion.cuda()
print(f'{datetime.datetime.now().time().replace(microsecond=0)} Starting to train..')
while epoch <= args.epochs-1:
print(f'********{datetime.datetime.now().time().replace(microsecond=0)} Epoch#: {epoch+1} / {args.epochs}')
model.train()
interval_loss, total_loss= 0,0
for i , (input,target) in enumerate(train_loader):
batch_count += 1
if args.gpu and torch.cuda.is_available():
input, target = input.cuda(), target.cuda()
input, target = input.float(), target.float()
pred = model(input)
loss = criterion(pred,target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
....
The saving process happened after finishing each epoch:
torch.save({'epoch': epoch,'batch':batch_count,'model_state_dict': model.state_dict(),'optimizer_state_dict':
optimizer.state_dict(),'loss': total_loss/len(train_loader),'train_set':args.train_set,'val_set':args.val_set,'args':args}, f'{args.weights_dir}/FastDepth_Final.pth')
I can't figure why I get this error.
args.gpu == True, and I'm passing the model, all data, and loss function to cuda, somehow there is still a tensor on cpu, could anyone figure out what's wrong?
Thanks.Solutionsource: stackoverflow \u2197
There might be an issue with the device parameters are on: If you need to move a model to GPU via .cuda() , please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call. In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.
API access
Get this solution programmatically \u2014 free, no authentication.
curl https://depscope.dev/api/error/2ee0ce0868b6256af1fe2db87b4f9b94aafec253062b9e15850225300600a016hash \u00b7 2ee0ce0868b6256af1fe2db87b4f9b94aafec253062b9e15850225300600a016