{"id":1057,"hash":"406f2b99663e61c3b5827a7f59c4ea453ac091fc9e4eabf003fa5fb77e79c4a7","pattern":"TfidfVectorizer in scikit-learn : ValueError: np.nan is an invalid document","full_message":"I'm using TfidfVectorizer from scikit-learn to do some feature extraction from text data. I have a CSV file with a Score (can be +1 or -1) and a Review (text). I pulled this data into a DataFrame so I can run the Vectorizer.\n\nThis is my code: \n\nimport pandas as pd\nimport numpy as np\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\ndf = pd.read_csv(\"train_new.csv\",\n             names = ['Score', 'Review'], sep=',')\n\n# x = df['Review'] == np.nan\n#\n# print x.to_csv(path='FindNaN.csv', sep=',', na_rep = 'string', index=True)\n#\n# print df.isnull().values.any()\n\nv = TfidfVectorizer(decode_error='replace', encoding='utf-8')\nx = v.fit_transform(df['Review'])\n\nThis is the traceback for the error I get: \n\nTraceback (most recent call last):\n  File \"/home/PycharmProjects/Review/src/feature_extraction.py\", line 16, in <module>\nx = v.fit_transform(df['Review'])\n File \"/home/b/hw1/local/lib/python2.7/site-   packages/sklearn/feature_extraction/text.py\", line 1305, in fit_transform\n   X = super(TfidfVectorizer, self).fit_transform(raw_documents)\n File \"/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py\", line 817, in fit_transform\nself.fixed_vocabulary_)\n File \"/home/b/work/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py\", line 752, in _count_vocab\n   for feature in analyze(doc):\n File \"/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py\", line 238, in <lambda>\ntokenize(preprocess(self.decode(doc))), stop_words)\n File \"/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py\", line 118, in decode\n raise ValueError(\"np.nan is an invalid document, expected byte or \"\n ValueError: np.nan is an invalid document, expected byte or unicode string.\n\nI checked the CSV file and DataFrame for anything that's being read as NaN but I can't find anything. There are 18000 rows, none of which return isnan as True. \n\nThis is what df['Review'].head() looks like: \n\n  0    This book is such a life saver.  It has been s...\n  1    I bought this a few times for my older son and...\n  2    This is great for basics, but I wish the space...\n  3    This book is perfect!  I'm a first time new mo...\n  4    During your postpartum stay at the hospital th...\n  Name: Review, dtype: object","ecosystem":"pypi","package_name":"pandas","package_version":null,"solution":"You need to convert the dtype object to unicode string as is clearly mentioned in the traceback.\n\nx = v.fit_transform(df['Review'].values.astype('U'))  ## Even astype(str) would work\n\nFrom the Doc page of TFIDF Vectorizer:\n\nfit_transform(raw_documents, y=None) \n\nParameters:     raw_documents : iterable \n\nan iterable which yields either str, unicode or file objects","confidence":0.95,"source":"stackoverflow","source_url":"https://stackoverflow.com/questions/39303912/tfidfvectorizer-in-scikit-learn-valueerror-np-nan-is-an-invalid-document","votes":69,"created_at":"2026-04-19T04:52:13.808461+00:00","updated_at":"2026-04-19T04:52:13.808461+00:00"}