Thursday, November 26, 2009

Classifier Accuracy using Cross-Validation

So far, I did not know of a good way to estimate the performance of a classifier. My usual approach would be to run the classifier against a set of test documents, and use the results as the basis for a somewhat subjective estimate of how good (or bad) the classifier was.

However, thanks to Dave who commented on this post, inquiring about an easy way to integrate in 10-fold cross validation into the classifier, I now have another, more objective way to estimate the performance of a classifier.

Cross-Validation is the process of splitting up your training set into k parts (usually 10), then using k-1 of these parts for training, and the remining part for testing, and recording your results. This step is repeated k times, each time using a different part for testing and the remaining parts for training. If you further randomize your training fold selection at each step, then repeating this entire process (usually 10 times) will give you more inputs for your estimate.

Integrating Cross-Validation

Integrating in the Cross-Validation into the Lucene Vector Space Classifier involved creating another train method which takes a set of document ids (for the Lucene index). The existing train() method delegates to the new train(Set<Integer> docIds) method. Here is the code for the new train() method.

A new method crossValidate() has been added, which takes in the number of folds and the number of times it should be run. At each run, the testing set is generated randomly containing (1/folds) records. The remaining records are the training set. A confusion matrix is updated with the results - this is just a matrix of predicted vs actual categories.

Once the runs are completed, the accuracy of the classifier can be determined as the sum of the diagonal elements of the matrix (i.e. all the cases where the predicted category matched the actual one) divided by the sum of all its elements. Some other measures can also be computed from this matrix - see this page - the notes for the full course may also be worth checking out.

Rather than cherry-pick which code changed and which did not, I am going to show the entire code here. The method of interest is the crossValidate() method.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
// Source: src/main/java/net/sf/jtmt/classifiers/LuceneVectorSpaceModelClassifier.java
package net.sf.jtmt.classifiers;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import net.sf.jtmt.clustering.ByValueComparator;
import net.sf.jtmt.indexers.IdfIndexer;
import net.sf.jtmt.indexers.TfIndexer;
import net.sf.jtmt.similarity.AbstractSimilarity;
import net.sf.jtmt.similarity.CosineSimilarity;

import org.apache.commons.collections15.Bag;
import org.apache.commons.collections15.Transformer;
import org.apache.commons.collections15.bag.HashBag;
import org.apache.commons.collections15.comparators.ReverseComparator;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.commons.math.linear.Array2DRowRealMatrix;
import org.apache.commons.math.linear.OpenMapRealMatrix;
import org.apache.commons.math.linear.RealMatrix;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.Field.TermVector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.TermFreqVector;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.store.RAMDirectory;

/**
 * Computes the position in term space of the centroid for each 
 * category during the training phase. During the classify phase,
 * the position in term space of the document to be classified is
 * computed and the cosine similarity of this document with each
 * of the centroids is computed. The category of the centroid which
 * is closest to the document is assigned to the document.
 * @author Sujit Pal
 */
public class LuceneVectorSpaceModelClassifier {

  private final Log log = LogFactory.getLog(getClass());
  
  private String indexDir;
  private String categoryFieldName;
  private String bodyFieldName;

  private Analyzer analyzer = new StandardAnalyzer();
  
  @SuppressWarnings("unchecked") 
  private Transformer<RealMatrix,RealMatrix>[] indexers = new Transformer[] {
    new TfIndexer(),
    new IdfIndexer()
  };

  private AbstractSimilarity similarity = new CosineSimilarity();
  
  private Map<String,RealMatrix> centroidMap;
  private Map<String,Integer> termIdMap;
  private Map<String,Double> similarityMap;
  
  /**
   * Set the directory where the Lucene index is located.
   * @param indexDir the index directory.
   */
  public void setIndexDir(String indexDir) {
    this.indexDir = indexDir;
  }
  
  /**
   * Set the name of the Lucene field containing the preclassified category.
   * @param categoryFieldName the category field name.
   */
  public void setCategoryFieldName(String categoryFieldName) {
    this.categoryFieldName = categoryFieldName;
  }
  
  /**
   * Set the name of the Lucene field containing the document body. The
   * document body must have been indexed with TermVector.YES.
   * @param bodyFieldName the name of the document body field.
   */
  public void setBodyFieldName(String bodyFieldName) {
    this.bodyFieldName = bodyFieldName;
  }

  /**
   * The Analyzer used for tokenizing the document body during indexing,
   * and to tokenize the text to be classified. If not specified, the
   * classifier uses the StandardAnalyzer.
   * @param analyzer the analyzer reference.
   */
  public void setAnalyzer(Analyzer analyzer) {
    this.analyzer = analyzer;
  }
  
  /**
   * A transformer chain of indexers (or normalizers) to normalize the
   * document matrices. If not specified, the default is a chain of TF/IDF.
   * @param indexers the normalizer chain.
   */
  public void setIndexers(Transformer<RealMatrix,RealMatrix>[] indexers) {
    this.indexers = indexers;
  }
  
  /**
   * The Similarity implementation used to calculate the similarity between
   * the text to be classified and the category centroid document matrices.
   * Uses CosineSimilarity if not specified.
   * @param similarity the similarity to use.
   */
  public void setSimilarity(AbstractSimilarity similarity) {
    this.similarity = similarity;
  }

  /**
   * Implements the logic for training the classifier. The input is a Lucene
   * index of preclassified documents. The classifier is provided the name
   * of the field which contains the document body, as well as the name of
   * the field which contains the category name. Additionally, the document
   * body must have had its Term Vectors computed during indexing (using 
   * TermVector.YES). The train method uses the Term Vectors to compute a
   * geometrical centroid for each category in the index. The set of category
   * names to their associated centroid document matrix is available via the
   * getCentroidMap() method after training is complete.
   * @throws Exception if one is thrown.
   */
  public void train() throws Exception {
    train(null);
  }

  /**
   * Slightly specialized version of the classifier train() method that
   * takes a Set of docIds. This is for doing cross-validation. If a null
   * Set of docIds is passed in, then it uses the entire training set.
   * @param docIds a Set of docIds to consider for training. If null, the
   *               all the docIds are considered for training.
   * @throws Exception if thrown.
   */
  public void train(Set<Integer> docIds) throws Exception {
    log.info("Classifier training started");
    IndexReader reader = IndexReader.open(indexDir);
    // Set up a data structure for the term versus the row id in the matrix.
    // This is going to be used for looking up the term's row in the matrix.
    this.termIdMap = computeTermIdMap(reader);
    // Initialize the data structures to hold the td matrices for the various
    // categories.
    Bag<String> docsInCategory = computeDocsInCategory(reader);
    Map<String,Integer> currentDocInCategory = new HashMap<String,Integer>();
    Map<String,RealMatrix> categoryTfMap = new HashMap<String,RealMatrix>();
    for (String category : docsInCategory.uniqueSet()) {
      int numDocsInCategory = docsInCategory.getCount(category);
      categoryTfMap.put(category, 
        new OpenMapRealMatrix(termIdMap.size(), numDocsInCategory));
      currentDocInCategory.put(category, new Integer(0));
    }
    // extract each document body's TermVector into the td matrix for
    // that document's category
    int numDocs = reader.numDocs();
    for (int i = 0; i < numDocs; i++) {
      Document doc = reader.document(i);
      if (docIds != null && docIds.size() > 0) {
        // check to see if the current document is in our training set,
        // and if so, train with it
        if (! docIds.contains(i)) {
          continue;
        }
      }
      String category = doc.get(categoryFieldName);
      RealMatrix tfMatrix = categoryTfMap.get(category);
      // get the term frequency map 
      TermFreqVector vector = reader.getTermFreqVector(i, bodyFieldName);
      String[] terms = vector.getTerms();
      int[] frequencies = vector.getTermFrequencies();
      for (int j = 0; j < terms.length; j++) {
        int row = termIdMap.get(terms[j]);
        int col = currentDocInCategory.get(category);
        tfMatrix.setEntry(row, col, new Double(frequencies[j]));
      }
      incrementCurrentDoc(currentDocInCategory, category);
    }
    reader.close();
    // compute centroid vectors for each category
    this.centroidMap = new HashMap<String,RealMatrix>();
    for (String category : docsInCategory.uniqueSet()) {
      RealMatrix tdmatrix = categoryTfMap.get(category);
      RealMatrix centroid = computeCentroid(tdmatrix);
      centroidMap.put(category, centroid);
    }
    log.info("Classifier training complete");
  }

  /**
   * Returns the centroid map of category name to TD Matrix containing the
   * centroid document of the category. This data is computed as a side
   * effect of the train() method.
   * @return the centroid map computed from the training.
   */
  public Map<String,RealMatrix> getCentroidMap() {
    return centroidMap;
  }
  
  /**
   * Returns the map of analyzed terms versus their positions in the centroid
   * matrices. The data is computed as a side-effect of the train() method.
   * @return a Map of analyzed terms to their position in the matrix.
   */
  public Map<String,Integer> getTermIdMap() {
    return termIdMap;
  }

  /**
   * Once the classifier is trained using the train() method, it creates a
   * Map of category to associated centroid documents for each category, and
   * a termIdMap, which is a mapping of tokenized terms to its row number in
   * the document matrix for the centroid documents. These two structures are
   * used by the classify method to match up terms from the input text with
   * corresponding terms in the centroids to calculate similarities. Builds 
   * a Map of category names and the similarities of the input text to the
   * centroids in each category as a side effect. Returns the category with
   * the highest similarity score, ie the category this text should be 
   * classified under.
   * @param centroids a Map of category names to centroid document matrices.
   * @param termIdMap a Map of terms to their positions in the document matrix.
   * @param text the text to classify.
   * @return the best category for the text.
   * @throws Exception if one is thrown.
   */
  public String classify(Map<String,RealMatrix> centroids, 
      Map<String,Integer> termIdMap, String text) throws Exception {
    RAMDirectory ramdir = new RAMDirectory();
    indexDocument(ramdir, "text", text);
    // now find the (normalized) term frequency vector for this
    RealMatrix docMatrix = buildMatrixFromIndex(ramdir, "text");
    // compute similarity using passed in Similarity implementation, we
    // use CosineSimilarity by default.
    this.similarityMap = new HashMap<String,Double>();
    for (String category : centroids.keySet()) {
      RealMatrix centroidMatrix = centroids.get(category);
      double sim = similarity.computeSimilarity(docMatrix, centroidMatrix);
      similarityMap.put(category, sim);
    }
    // sort the categories
    List<String> categories = new ArrayList<String>();
    categories.addAll(centroids.keySet());
    Collections.sort(categories, 
      new ReverseComparator<String>(
      new ByValueComparator<String,Double>(similarityMap)));
    // return the best category, the similarity map is also available
    // to the client for debugging or display.
    return categories.get(0);
  }

  /**
   * Returns the map of category to similarity with the document after
   * classification. The similarityMap is computed as a side-effect of
   * the classify() method, so the data is interesting only if this method
   * is called after classify() completes successfully.
   * @return map of category to similarity scores for text to classify.
   */
  public Map<String,Double> getSimilarityMap() {
    return similarityMap;
  }
  
  /**
   * Loops through the IndexReader's TermEnum enumeration, and creates a Map
   * of term to an integer id. This map is going to be used to assign string
   * terms to specific rows in the Term Document Matrix for each category.
   * @param reader a reference to the IndexReader.
   * @return a Map of terms to their integer ids (0-based).
   * @throws Exception if one is thrown.
   */
  private Map<String, Integer> computeTermIdMap(IndexReader reader) throws Exception {
    Map<String,Integer> termIdMap = new HashMap<String,Integer>();
    int id = 0;
    TermEnum termEnum = reader.terms();
    while (termEnum.next()) {
      String term = termEnum.term().text();
      if (termIdMap.containsKey(term)) {
        continue;
      }
      termIdMap.put(term, id);
      id++;
    }
    return termIdMap;
  }

  /**
   * Loops through the specified IndexReader and returns a Bag of categories
   * and their document counts. We don't use the BitSet/DocIdSet approach 
   * here because we don't know how many categories the training documents have 
   * been classified into.
   * @param reader the reference to the IndexReader.
   * @return a Bag of category names and counts.
   * @throws Exception if one is thrown.
   */
  private Bag<String> computeDocsInCategory(IndexReader reader) throws Exception {
    int numDocs = reader.numDocs();
    Bag<String> docsInCategory = new HashBag<String>();
    for (int i = 0; i < numDocs; i++) {
      Document doc = reader.document(i);
      String category = doc.get(categoryFieldName);
      docsInCategory.add(category);
    }
    return docsInCategory;
  }

  /**
   * Increments the counter for the category to point to the next document
   * index. This is used to manage the document index in the td matrix for
   * the category.
   * @param currDocs the Map of category-wise document Id counters.
   * @param category the category whose document-id we want to increment.
   */
  private void incrementCurrentDoc(Map<String,Integer> currDocs, String category) {
    int currentDoc = currDocs.get(category);
    currDocs.put(category, currentDoc + 1);
  }
  
  /**
   * Compute the centroid document from the TD Matrix. Result is a matrix 
   * of term weights but for a single document only.
   * @param tdmatrix
   * @return
   */
  private RealMatrix computeCentroid(RealMatrix tdmatrix) {
    tdmatrix = normalizeWithTfIdf(tdmatrix);
    RealMatrix centroid = new OpenMapRealMatrix(tdmatrix.getRowDimension(), 1);
    int numDocs = tdmatrix.getColumnDimension();
    int numTerms = tdmatrix.getRowDimension();
    for (int row = 0; row < numTerms; row++) {
      double rowSum = 0.0D;
      for (int col = 0; col < numDocs; col++) {
        rowSum += tdmatrix.getEntry(row, col); 
      }
      centroid.setEntry(row, 0, rowSum / ((double) numDocs));
    }
    return centroid;
  }

  /**
   * Builds an in-memory Lucene index using the text supplied for classification.
   * @param ramdir the RAM Directory reference.
   * @param fieldName the field name to index the text as.
   * @param text the text to index.
   * @throws Exception if one is thrown.
   */
  private void indexDocument(RAMDirectory ramdir, String fieldName, String text) 
      throws Exception {
    IndexWriter writer = new IndexWriter(ramdir, analyzer, MaxFieldLength.UNLIMITED);
    Document doc = new Document();
    doc.add(new Field(fieldName, text, Store.YES, Index.ANALYZED, TermVector.YES));
    writer.addDocument(doc);
    writer.commit();
    writer.close();
  }

  /**
   * Given a Lucene index and a field name with pre-computed TermVectors,
   * creates a document matrix of terms. The document matrix is normalized
   * using the specified indexer chain.
   * @param ramdir the RAM Directory reference.
   * @param fieldName the name of the field to build the matrix from. 
   * @return a normalized Document matrix of terms and frequencies.
   * @throws Exception if one is thrown.
   */
  private RealMatrix buildMatrixFromIndex(RAMDirectory ramdir, String fieldName) 
      throws Exception {
    IndexReader reader = IndexReader.open(ramdir);
    TermFreqVector vector = reader.getTermFreqVector(0, fieldName);
    String[] terms = vector.getTerms();
    int[] frequencies = vector.getTermFrequencies();
    RealMatrix docMatrix = new OpenMapRealMatrix(termIdMap.size(), 1);
    for (int i = 0; i < terms.length; i++) {
      String term = terms[i];
      if (termIdMap.containsKey(term)) {
        int row = termIdMap.get(term);
        docMatrix.setEntry(row, 0, frequencies[i]);
      }
    }
    reader.close();
    // normalize the docMatrix using TF*IDF
    docMatrix = normalizeWithTfIdf(docMatrix);
    return docMatrix;
  }

  /**
   * Pass the input TD Matrix through a chain of transformers to normalize
   * the TD Matrix. Here we do TF/IDF normalization, although it is possible
   * to do other types of normalization (such as LSI) by passing in the
   * appropriate chain of normalizers (or indexers).
   * @param docMatrix the un-normalized TD Matrix.
   * @return the normalized TD Matrix.
   */
  private RealMatrix normalizeWithTfIdf(RealMatrix docMatrix) {
    for (Transformer<RealMatrix,RealMatrix> indexer : indexers) {
      docMatrix = indexer.transform(docMatrix);
    }
    return docMatrix;
  }

  public double crossValidate(int folds, int times) throws Exception {
    IndexReader reader = IndexReader.open(indexDir);
    int numDocs = reader.maxDoc();
    int numDocsPerFold = numDocs / folds;
    Set<String> categories = computeDocsInCategory(reader).uniqueSet();
    Map<String,Integer> categoryPosMap = new HashMap<String,Integer>();
    int pos = 0;
    for (String categoryName : categories) {
      categoryPosMap.put(categoryName, pos);
      pos++;
    }
    int numCats = categories.size();
    RealMatrix confusionMatrix = 
      new Array2DRowRealMatrix(numCats, numCats);
    for (int i = 0; i < times; i++) {
      for (int j = 0; j < folds; j++) {
        reset();
        Map<String,Set<Integer>> partition = new HashMap<String,Set<Integer>>();
        partition.put("test", new HashSet<Integer>());
        partition.put("train", new HashSet<Integer>());
        Set<Integer> testDocs = generateRandomTestDocs(numDocs, numDocsPerFold);
        for (int k = 0; k < numDocs; k++) {
          if (testDocs.contains(k)) {
            partition.get("test").add(k);
          } else {
            partition.get("train").add(k);
          }
        }
        train(partition.get("train"));
        for (int docId : partition.get("test")) {
          Document testDoc = reader.document(docId);
          String actualCategory = testDoc.get(categoryFieldName);
          String body = testDoc.get(bodyFieldName);
          Map<String,RealMatrix> centroidMap = getCentroidMap();
          Map<String,Integer> termIdMap = getTermIdMap();
          String predictedCategory = classify(centroidMap, termIdMap, body);
          // increment the counter for the confusion matrix
          int row = categoryPosMap.get(actualCategory);
          int col = categoryPosMap.get(predictedCategory);
          confusionMatrix.setEntry(row, col, 
            confusionMatrix.getEntry(row, col) + 1); 
        }
      }
    }
    // print confusion matrix
    prettyPrint(confusionMatrix, categoryPosMap);
    // compute accuracy
    double trace = confusionMatrix.getTrace(); // sum of diagnonal elements
    double sum = 0.0D;
    for (int i = 0; i < confusionMatrix.getRowDimension(); i++) {
      for (int j = 0; j < confusionMatrix.getColumnDimension(); j++) {
        sum += confusionMatrix.getEntry(i, j);
      }
    }
    double accuracy = trace / sum;
    return accuracy;
  }

  private Set<Integer> generateRandomTestDocs(int numDocs, int numDocsPerFold) {
    Set<Integer> docs = new HashSet<Integer>();
    while (docs.size() < numDocsPerFold) {
      docs.add((int) (numDocs * Math.random() - 1));
    }
    return docs;
  }

  private void prettyPrint(RealMatrix confusionMatrix, 
      Map<String,Integer> categoryPosMap) {
    System.out.println("==== Confusion Matrix ====");
    // invert the map and write the header
    System.out.printf("%10s", " ");
    Map<Integer,String> posCategoryMap = new HashMap<Integer,String>();
    for (String category : categoryPosMap.keySet()) {
      posCategoryMap.put(categoryPosMap.get(category), category);
      System.out.printf("%8s", category);
    }
    System.out.printf("%n");
    for (int i = 0; i < confusionMatrix.getRowDimension(); i++) {
      System.out.printf("%10s", posCategoryMap.get(i));
      for (int j = 0; j < confusionMatrix.getColumnDimension(); j++) {
        System.out.printf("%8d", (int) confusionMatrix.getEntry(i, j));
      }
      System.out.printf("%n");
    }
  }

  /**
   * Reset internal data structures. Used by the cross validation process
   * to reset the internal state of the classifier between runs.
   */
  private void reset() {
    if (centroidMap != null) {
      centroidMap.clear();
    }
    if (termIdMap != null) {
      termIdMap.clear();
    }
    if (similarityMap != null) {
      similarityMap.clear();
    }
  }
}

And here is the test code - this just contains the test code to run the cross validation. The full code is available in the jtmt project repository.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Source: src/test/java/net/sf/jtmt/classifiers/LuceneVectorSpaceModelClassifierTest.java
package net.sf.jtmt.classifiers;

public class LuceneVectorSpaceModelClassifierTest {

  @SuppressWarnings("unchecked")
  @Test
  public void testCrossValidate() throws Exception {
    LuceneVectorSpaceModelClassifier classifier = 
      new LuceneVectorSpaceModelClassifier();
    // setup
    classifier.setIndexDir(INDEX_DIR);
    classifier.setAnalyzer(new SummaryAnalyzer());
    classifier.setCategoryFieldName("category");
    classifier.setBodyFieldName("body");
    // this is the default but we set it anyway, to illustrate usage
    classifier.setIndexers(new Transformer[] {
      new TfIndexer(),
      new IdfIndexer()
    });
    // this is the default but we set it anyway, to illustrate usage.
    // Similarity need not be set before training, it can be set before
    // the classification step.
    classifier.setSimilarity(new CosineSimilarity());
    double accuracy = classifier.crossValidate(10, 10);
    System.out.println("accuracy=" + accuracy);
  }
}

Results

The confusion matrix for this run is shown below. As you can see, it does quite well for cocoa documents, not so well for coffee, and very poorly indeed for sugar documents.

1
2
3
4
5
==== Confusion Matrix ====
             cocoa  coffee   sugar
     cocoa     147       0       0
    coffee     129      23       0
     sugar     179      17       5

Based on the above matrix, the accuracy of the classifier is (147+23+5)/(147+129+23+179+17+5) = 0.35 (i.e. it is expected to classify correctly about 35% of the time). Looks like there may be lots of room for improvement.

Be the first to comment. Comments are moderated to prevent spam.