Saturday, February 03, 2018

Using Snorkel Probabilistic Labels for Classification


Last week, I wrote about using the Snorkel Generative model to convert noisy labels to an array of marginal probabilities for the label being in each class. This week, I will describe the second part of the experiment, where I use these probabilistic labels to train a Discriminative model such as a Classifier. As a reminder, the standard pipeline for a Snorkel use-case looks like the diagram shown below.


Noisy labels can be generated in a variety of ways, such as weak supervision through the use of labeling functions, distant supervision through reference ontologies, unsupervised models, or predictions from weaker models. These labels may overlap or conflict with other labels. So assuming N labeling functions, we would start with N noisy labels per input record. Assuming that we want to build a k-class classifier, the labels would need to be 1 of k classes, the cardinality of the Generative model would be k, and the output of the generative model would be an array of size k for each input record. These k numbers represent the probability of the record being in the corresponding class, and add up to 1.

The next step is to train a noise-aware discriminative model, using the original data and these probabilistic labels. A noise-aware discriminative model uses a noise-aware loss function, which is just the expected loss with respect to the noisy training set model. This turns out to just be the cross-entropy between the label and prediction (see this blog post on Data Programming with Tensorflow by the Snorkel team for the derivation for a binary classification model, but I think it can be easily extended to a k-class classification scenario as well).

As an example, consider a 3-class probabilistic label ytrue_prob and the corresponding categorical label ytrue_cat. We see that we can compute the cross entropy between either of these label vectors against a prediction vector ypred in the exact same way using the Keras loss function categorical_crossentropy. So we can use the probabilistic label generated by the Snorkel discriminative model in the exact same way we could using real labels (converted to categorical form).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from keras.losses import categorical_crossentropy
import keras.backend as K
import numpy as np

with K.get_session() as sess:
    
    ytrue_prob = K.constant(np.array([0.9, 0.03, 0.07]))
    ytrue_cat = K.constant(np.array([1., 0., 0.]))

    ypred = K.constant(np.array([0.7, 0.15, 0.15])) 
    
    loss_cat = categorical_crossentropy(ytrue_cat, ypred)
    loss_prob = categorical_crossentropy(ytrue_prob, ypred)
    
    loss_cat_val, loss_prob_val = sess.run([loss_cat, loss_prob])
    print(loss_cat_val, loss_prob_val)

The output values are obviously different, but both are floating point scalars, and our objective is to minimize it during the training. I decided to test this intuition to see if I could train a network to classify based on probabilistic labels and see how much performance gain or loss I got with the corresponding categorical label. The data I used is the same as the one I used to train the Snorkel Generative model, from the Snorkel crowdsourcing example.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
from keras import regularizers
from keras.callbacks import ModelCheckpoint
from keras.layers import Input
from keras.layers.core import Dense, Dropout, Flatten
from keras.layers.embeddings import Embedding
from keras.models import Model, load_model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import collections
import nltk
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

# constants
DATA_DIR = "data"

CLEAN_LABELS_FILE = os.path.join(DATA_DIR, "train-clean-labels.csv")
NOISY_LABELS_FILE = os.path.join(DATA_DIR, "train-noisy-labels.csv")
LABEL_LOOKUP_FILE = os.path.join(DATA_DIR, "label-lookup.csv")

BEST_MODEL_P = os.path.join(DATA_DIR, "disc-model-p-best.h5")
FINAL_MODEL_P = os.path.join(DATA_DIR, "disc-model-p-final.h5")

BEST_MODEL_C = os.path.join(DATA_DIR, "disc-model-c-best.h5")
FINAL_MODEL_C = os.path.join(DATA_DIR, "disc-model-c-final.h5")

# extract data
noisy_df = pd.read_csv(NOISY_LABELS_FILE)
clean_df = pd.read_csv(CLEAN_LABELS_FILE)
data_df = noisy_df.join(clean_df.set_index("tweet_id"), 
                        how="inner", on="tweet_id", rsuffix="_r")
data_df = data_df.loc[:, ["tweet_id", "tweet_body", 
                          "cls_1", "cls_2", "cls_3", "cls_4", "cls_5",
                          "sentiment"]]

data, prob_labels, true_labels = [], [], []
max_num_words = 0
word_counts = collections.Counter()
for row in data_df.values:
    # read tweet, normalize, tokenize and collect word counts
    words = [word.lower() for word in nltk.word_tokenize(row[1])
                          if not word.startswith("@")]
    if max_num_words < len(words):
        max_num_words = len(words)
    for word in words:
        word_counts[word] += 1
    data.append(" ".join(words))
    prob_labels.append(row[2:7])
    true_labels.append(row[7])

# constants derived from data after exploratory analysis
num_recs = len(data)
max_len = 30
vocab_size = 1300
num_classes = len(prob_labels[0])

# convert data to matrices
X = np.zeros((num_recs, max_len))
Yp = np.zeros((num_recs, num_classes))
Yc = np.zeros((num_recs, num_classes))

for i, (tweet, prob_label, true_label) in enumerate(zip(data, prob_labels, true_labels)):
    X[i] = np.array(pad_sequences([one_hot(tweet, vocab_size, split=" ")], 
                                  maxlen=max_len))
    Yp[i] = np.array(prob_label)
    Yc[i] = to_categorical(true_label-1, num_classes=num_classes)
    
Xtrain, Xtest, Yptrain, Yptest, Yctrain, Yctest = train_test_split(X, Yp, Yc,
    train_size=0.7, test_size=0.3, random_state=42)

This gives use two sets of data, the training set with 700 records and a test set of 300 records. Each tweet is represented by a integer sequence of size 30 - during exploratory analysis, we found that the maximum length of a tweet was 39 words, and the number of unique words in the vocabulary was 3,781, of which 1,295 occurred more than once. So we decided to cut our vocabulary size to 1300. Since we are using the Keras one_hot function, this uses the hashing trick and projects our vocabulary of 3,781 words to 1300 positions. We also pad shorter sentences to 30 words, so we need an additional PAD character (0).

The Yp* array contains the probabilistic labels. Since we are looking at 5 classes, each row has 5 columns. The Yc* array projects the categorical variable onto a 1-hot space of 5 columns, so each row of this matrix has 5 columns.

Our objective is to train two identical networks, one using the probabilistic labels and one with the categorical labels and evaluate them. Below we define some functions that we will reuse across the two networks.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def build_model():
    seq_input = Input(shape=(max_len,), dtype="int32")
    x = Embedding(vocab_size + 1, 100, input_length=max_len)(seq_input)
    x = Flatten()(x)
    x = Dense(64, activation="relu")(x)
    preds = Dense(num_classes, activation="softmax")(x)
    model = Model(inputs=[seq_input], outputs=[preds])
    return model

def compile_model(model):
    model.compile(loss="categorical_crossentropy", optimizer="adam", 
                metrics=["acc"])
    return model

def fit_model(model, best_model_file, Xtrain, Ytrain):
    checkpoint = ModelCheckpoint(filepath=best_model_file, save_best_only=True)
    history = model.fit(Xtrain, Ytrain, validation_split=0.1, 
                        epochs=10, batch_size=64,
                        callbacks=[checkpoint])
    return history

def eval_report(title, Ytest, Ytest_):
    ytest = np.argmax(Ytest, axis=1)
    ytest_ = np.argmax(Ytest_, axis=1)
    acc = accuracy_score(ytest, ytest_)
    cm = confusion_matrix(ytest, ytest_)
    print("\n*** {:s}".format(title.upper()))
    print("accuracy: {:.3f}".format(acc))
    print("confusion matrix")
    print(cm)

We use a very simple neural network to build a word embedding from our integer sequence - each word ends up getting represented by a vector of size 100. This tensor is then flattened and sent through two Dense layers. The network structure is shown below.


We then train the first network using the probabilistic labels and the second one using the categorical labels, and evaluate them.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
model_p = build_model()
model_p = compile_model(model_p)
fit_model(model_p, BEST_MODEL_P, Xtrain, Yptrain)
best_model_p = load_model(BEST_MODEL_P)
Yptest_ = best_model_p.predict(Xtest)
eval_report("probabilistic", Yptest, Yptest_)

model_c = build_model()
model_c = compile_model(model_c)
fit_model(model_c, BEST_MODEL_C, Xtrain, Yctrain)
best_model_c = load_model(BEST_MODEL_C)
Yctest_ = best_model_c.predict(Xtest)
eval_report("categorical", Yctest, Yctest_)

In our results, the network trained on categorical labels did slightly better (accuracy: 0.927) than the one trained on probabilistic labels (accuracy: 0.923). This kind of makes sense, since the categorical labels are created (presumably at great expense) by humans, while the probabilistic labels are generated from less expert crowdsourced workers in this case, and by cheaper automatic methods in general. However, at least for this dataset, the difference in performance is very small.

I thought this was particularly encouraging for use cases around deep learning, which typically need large amounts of training data, and which use categorical labels. Generating noisy labels and cleaning them up using the Snorkel Generative model seems to be a good approach to getting large amounts of usable labeled data for classification.


2 comments (moderated to prevent spam):

disdi said...

Hi Sujit,

Could you share the code you used in both your articles about Snorkel.
It would be really helpful.

Sujit Pal said...

Hi disdi, I have shared the relevant code in the blog posts themselves. There is no LFs here because the "features" are the annotations given by 1 of 20 crowdsourced workers.