ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error
def split_data(path):
df = pd.read_csv(path)
return train_test_split(df , test_size=0.1, random_state=100)
train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list()
test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list()
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.1, random_state=100)
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
When I tried to split from the dataframe using BERT tokenizers I got an error us such.I had the same error. The problem was that I had None in my list, e.g: from transformers import DistilBertTokenizerFast tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-german-cased') # create test dataframe texts = ['Vero Moda Damen Übergangsmantel Kurzmantel Chic Business Coatigan SALE', 'Neu Herren Damen Sportschuhe Sneaker Turnschuhe Freizeit 1975 Schuhe Gr. 36-46', 'KOMBI-ANGEBOT Zuckerpaste STRONG / SOFT / ZUBEHÖR -Sugaring Wachs Haarentfernung', None] labels = [1, 2, 3, 1] d = {'texts': texts, 'labels': labels} test_df = pd.DataFrame(d) So, before I converted the Dataframe columns to list I remove all None rows. test_df = test_df.dropna() texts = test_df["texts"].tolist() texts_encodings = tokenizer(texts, truncation=True, padding=True) This worked for me.
Get this solution programmatically \u2014 free, no authentication.
curl https://depscope.dev/api/error/07507662b8826568f63cac752468fe8ccca4f8d59197b521907a9f8a90a24b58