{"id":1030,"hash":"e92e268001f0ff95cfa95404ad0f71cb7de8be17839a5ea9fa674029d75b4eca","pattern":"HuggingFace: ValueError: expected sequence of length 165 at dim 1 (got 128)","full_message":"I am trying to fine-tune the BERT language model on my own data. I've gone through their docs, but their tasks seem to be not quite what I need, since my end goal is embedding text. Here's my code:\n\nfrom datasets import load_dataset\nfrom transformers import BertTokenizerFast, AutoModel, TrainingArguments, Trainer\nimport glob\nimport os\n\nbase_path = '../data/'\nmodel_name = 'bert-base-uncased'\nmax_length = 512\ncheckpoints_dir = 'checkpoints'\n\ntokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)\n\ndef tokenize_function(examples):\n    return tokenizer(examples['text'], padding=True, truncation=True, max_length=max_length)\n\ndataset = load_dataset('text',\n        data_files={\n            'train': f'{base_path}train.txt',\n            'test': f'{base_path}test.txt',\n            'validation': f'{base_path}valid.txt'\n        }\n)\n\nprint('Tokenizing data. This may take a while...')\ntokenized_dataset = dataset.map(tokenize_function, batched=True)\ntrain_dataset = tokenized_dataset['train']\neval_dataset = tokenized_dataset['test']\n\nmodel = AutoModel.from_pretrained(model_name)\n\ntraining_args = TrainingArguments(checkpoints_dir)\n\nprint('Training the model...')\ntrainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)\ntrainer.train()\n\nI get the following error:\n\n  File \"train_lm_hf.py\", line 44, in <module>\n    trainer.train()\n...\n  File \"/opt/conda/lib/python3.7/site-packages/transformers/data/data_collator.py\", line 130, in torch_default_data_collator\n    batch[k] = torch.tensor([f[k] for f in features])\nValueError: expected sequence of length 165 at dim 1 (got 128)\n\nWhat am I doing wrong?","ecosystem":"pypi","package_name":"deep-learning","package_version":null,"solution":"I fixed this solution by changing the tokenize function to:\n\ndef tokenize_function(examples):\n    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=max_length)\n\n(note the padding argument). Also, I used a data collator like so:\n\ndata_collator = DataCollatorForLanguageModeling(\n    tokenizer=tokenizer, mlm=True, mlm_probability=0.15\n)\ntrainer = Trainer(\n        model=model,\n        args=training_args,\n        data_collator=data_collator,\n        train_dataset=train_dataset,\n        eval_dataset=eval_dataset\n)","confidence":0.95,"source":"stackoverflow","source_url":"https://stackoverflow.com/questions/71166789/huggingface-valueerror-expected-sequence-of-length-165-at-dim-1-got-128","votes":14,"created_at":"2026-04-19T04:52:12.293911+00:00","updated_at":"2026-04-19T04:52:12.293911+00:00"}