{"id":1032,"hash":"946572afead4c07e5f00b4854d01b00d394dc3c92d97d9424e7d86816c2848ae","pattern":"Tokenizer.from_file() HUGGINFACE : Exception: data did not match any variant of untagged enum ModelWrapper","full_message":"I am having issue loading a Tokenizer.from_file() BPE tokenizer.\nWhen I try I am encountering this error where the line 11743 is the last last one:\nException: data did not match any variant of untagged enum ModelWrapper at line 11743 column 3\nI have no idea what is the problem and how to solve it\ndoes anyone have some clue?\nI did not train directly the BPE but the structure is the correct one so vocab and merges in a json. What I did was from a BPE trained by me (that was working) change completely the vocab and the merges based on something manually created by me (without a proper train). But I don't see the problem since the structure should be the same as the original one.\nMy tokenizer version is: 0.13.1\n\n{\n  \"version\":\"1.0\",\n  \"truncation\":null,\n  \"padding\":null,\n  \"added_tokens\":[\n    {\n      \"id\":0,\n      \"content\":\"[UNK]\",\n      \"single_word\":false,\n      \"lstrip\":false,\n      \"rstrip\":false,\n      \"normalized\":false,\n      \"special\":true\n    },\n    {\n      \"id\":1,\n      \"content\":\"[CLS]\",\n      \"single_word\":false,\n      \"lstrip\":false,\n      \"rstrip\":false,\n      \"normalized\":false,\n      \"special\":true\n    },\n    {\n      \"id\":2,\n      \"content\":\"[SEP]\",\n      \"single_word\":false,\n      \"lstrip\":false,\n      \"rstrip\":false,\n      \"normalized\":false,\n      \"special\":true\n    },\n    {\n      \"id\":3,\n      \"content\":\"[PAD]\",\n      \"single_word\":false,\n      \"lstrip\":false,\n      \"rstrip\":false,\n      \"normalized\":false,\n      \"special\":true\n    },\n    {\n      \"id\":4,\n      \"content\":\"[MASK]\",\n      \"single_word\":false,\n      \"lstrip\":false,\n      \"rstrip\":false,\n      \"normalized\":false,\n      \"special\":true\n    }\n  ],\n  \"normalizer\":null,\n  \"pre_tokenizer\":{\n    \"type\":\"Whitespace\"\n  },\n  \"post_processor\":null,\n  \"decoder\":null,\n  \"model\":{\n    \"type\":\"BPE\",\n    \"dropout\":null,\n    \"unk_token\":\"[UNK]\",\n    \"continuing_subword_prefix\":null,\n    \"end_of_word_suffix\":null,\n    \"fuse_unk\":false,\n    \"vocab\":{\n      \"[UNK]\":0,\n      \"[CLS]\":1,\n      \"[SEP]\":2,\n      \"[PAD]\":3,\n      \"[MASK]\":4,\n      \"AA\":5,\n      \"A\":6,\n      \"C\":7,\n      \"D\":8,\n.....\n\nmerges:\n\n....\n      \"QD FLPDSITF\",\n      \"QPHY AS\",\n      \"LR SE\",\n      \"A DRV\"\n    ] #11742\n  } #11743\n} #11744","ecosystem":"pypi","package_name":"json","package_version":null,"solution":"When I encountered this problem the root cause was a missing pre_tokenizer so in my case adding Whitespace pre tokenizer solved the issue.\n\nHere is an example:\n\ntokenizer = Tokenizer(BPE())\ntokenizer.pre_tokenizer = Whitespace()","confidence":0.7000000000000001,"source":"stackoverflow","source_url":"https://stackoverflow.com/questions/74279005/tokenizer-from-file-hugginface-exception-data-did-not-match-any-variant-of","votes":12,"created_at":"2026-04-19T04:52:12.295015+00:00","updated_at":"2026-04-19T04:52:12.295015+00:00"}