pypiscikit-learn95% confidence\u2191 95

RandomForestClassfier.fit(): ValueError: could not convert string to float

Full error message

Given is a simple CSV file:

A,B,C
Hello,Hi,0
Hola,Bueno,1

Obviously the real dataset is far more complex than this, but this one reproduces the error. I'm attempting to build a random forest classifier for it, like so:

cols = ['A','B','C']
col_types = {'A': str, 'B': str, 'C': int}
test = pd.read_csv('test.csv', dtype=col_types)

train_y = test['C'] == 1
train_x = test[cols]

clf_rf = RandomForestClassifier(n_estimators=50)
clf_rf.fit(train_x, train_y)

But I just get this traceback when invoking fit():

ValueError: could not convert string to float: 'Bueno'

scikit-learn version is 0.16.1.

Solutionsource: stackoverflow \u2197

You have to do some encoding before using fit(). As it was told fit() does not accept strings, but you solve this. There are several classes that can be used : LabelEncoder : turn your string into incremental value OneHotEncoder : use One-of-K algorithm to transform your String into integer Personally, I have post almost the same question on Stack Overflow some time ago. I wanted to have a scalable solution, but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective, but if you have a lot of different strings the matrix will grow very quickly and memory will be required.

API access

Get this solution programmatically \u2014 free, no authentication.

curl https://depscope.dev/api/error/07b17be6ee8f3e0a0a02baf715de7e04641a9bd0ccc1694097d6f09f6378387a

hash \u00b7 07b17be6ee8f3e0a0a02baf715de7e04641a9bd0ccc1694097d6f09f6378387a