pypideep-learning95% confidence\u2191 14

HuggingFace: ValueError: expected sequence of length 165 at dim 1 (got 128)

Full error message

I am trying to fine-tune the BERT language model on my own data. I've gone through their docs, but their tasks seem to be not quite what I need, since my end goal is embedding text. Here's my code:

from datasets import load_dataset
from transformers import BertTokenizerFast, AutoModel, TrainingArguments, Trainer
import glob
import os

base_path = '../data/'
model_name = 'bert-base-uncased'
max_length = 512
checkpoints_dir = 'checkpoints'

tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

def tokenize_function(examples):
    return tokenizer(examples['text'], padding=True, truncation=True, max_length=max_length)

dataset = load_dataset('text',
        data_files={
            'train': f'{base_path}train.txt',
            'test': f'{base_path}test.txt',
            'validation': f'{base_path}valid.txt'
        }
)

print('Tokenizing data. This may take a while...')
tokenized_dataset = dataset.map(tokenize_function, batched=True)
train_dataset = tokenized_dataset['train']
eval_dataset = tokenized_dataset['test']

model = AutoModel.from_pretrained(model_name)

training_args = TrainingArguments(checkpoints_dir)

print('Training the model...')
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()

I get the following error:

  File "train_lm_hf.py", line 44, in <module>
    trainer.train()
...
  File "/opt/conda/lib/python3.7/site-packages/transformers/data/data_collator.py", line 130, in torch_default_data_collator
    batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 165 at dim 1 (got 128)

What am I doing wrong?

Solutionsource: stackoverflow \u2197

I fixed this solution by changing the tokenize function to: def tokenize_function(examples): return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=max_length) (note the padding argument). Also, I used a data collator like so: data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=True, mlm_probability=0.15 ) trainer = Trainer( model=model, args=training_args, data_collator=data_collator, train_dataset=train_dataset, eval_dataset=eval_dataset )

API access

Get this solution programmatically \u2014 free, no authentication.

curl https://depscope.dev/api/error/e92e268001f0ff95cfa95404ad0f71cb7de8be17839a5ea9fa674029d75b4eca

hash \u00b7 e92e268001f0ff95cfa95404ad0f71cb7de8be17839a5ea9fa674029d75b4eca