This analysis will compare different text representations with simple sklearn linear and ensemble classifiers. The basis of all the text representations tested in this analysis is the sklearn vectorizer. Another, more common approach to represent text is based on word embeddings, but that is out of the scope of this post, and will be explained in a subsequent post.

The sklearn basic vectorizers are the focus of this analysis, because they have shown that their baseline performance previously tested was comparable in performance to state of the art text representation using BERT embeddings. Since previously no experimentation was done on the input representation to these basic linear classifiers and the classifier performance was still comparable with state of the art methods, these simple sklearn classifiers will be re-visited and tested again with different variations of the input text representation to see if any substantial improvements can be obtained.

Text Pre-Processing and Vectorization

There were several variations of each phase of the text processing pipeline tested in this experiment, as will be explained in the following sections.

The main phases are:

  • Text Cleaning
  • Text Analyzer
  • N-grams
  • Text vectorizer

Text Cleaning

There are numerous variations of what can be done during the text cleaning phase, such as removing certain terms, numbers, punctuation, lemmatizing words, lower casing words, etc. The first step applied in all cases is lower-casing all words, this is to reduce the vocabulary size and noisiness. There were several variations of subsequents steps to the text cleaning applied next:

  • None: input text remains as is
  • Removing numbers: often numeric values do not provide significant meaning and only add noise, e.g. the numbers 1, 5000, 35, etc. … if these numbers are mentioned in the text, each one would be a new term in the vocabulary.
  • Removing numbers and punctuation: in this variation the only characters remaining are letters, and so the vocabulary will be much cleaner, made up only of words. However this is possible that it could also lose some of the meaning connected to numbers and punctuation that was used.

Text Tokenization

The variations of text cleaning previously explained were all used with the bert tokenization method described below. Other variants of tokenization were also used, including the most simple word based approach built in to sklearn vectorizer, and also the character-based analyzer option built into the sklearn vectorizer.

  • Bert-based method: This method is a custom tokenization procedure, that is used for input to the BERT classifiers. Bert uses SentencePiece, an unsupervised text tokenizer that implements subword units (e.g., byte-pair-encoding). The steps followed in this procedure are:
    • Word tokenize it (i.e. “susie is calling” -> [“susie”, “is”, “calling”])
    • Break words into WordPieces (i.e. “calling” -> [“call”, “##ing”])
    • Add special “CLS” and “SEP” tokens
    • Append “index” and “segment” tokens to each input
  • Word Tokenization: basic tokenization based on white spaces between words
  • Character based tokenization: each character is considered an individual unit, and tokens are made up of character n-grams, as will be described in the following section

N-grams

N-grams determine the number of tokens to include as an individual unit to count in the document. For example, character-based 3-grams for “good morning” would be: “goo”, “ood”, “od “, “d m”, “ mo”, “mor” … etc., with each combination of 3 adjacent characters. For sequences of words, the 3-grams for a the phrase “the dog ran up the hill” are “# the dog”, “the dog ran”, “dog ran up”, … etc.. Any number of n can be chosen as the the number of n-grams to consider. The numbers chosen for this experiments were:

  • bert based tokens: unigrams only, unigrams bigrams and trigrams
  • word tokens: unigram, unigrams bigrams and trigrams
  • character based: n-grams from three to ten characters

Note, increasing the number of n-grams to include will greatly increase the size of the vocabulary, and so there must be methods to mitigate this great increase in vocabulary size, such as setting a maximum vocabulary size, and excluding vocabulary terms that are too rare (most terms will be rare due to zipf’s law).

Text Vectorizers

There are multiple versions of the vectorizers that can be used from sklearn including:

  • Count vectors: each text is translated into a vector directly based on the counts of each term in the vocabulary. For example, with a vocabulary vector of [dog, runs, the, up, hill, good], the text “The dog runs up the hill. The dog is good.” would be transformed into the vector: [dog:2, runs:1, the:3, up:1, hill:1, good:1].
  • term-frequency inverse-document-frequency (tfidf): is the product of two statistics, term frequency and inverse document frequency, which has the effect of weighting terms by importance. Note how the word “the” had a high weight in the previous example, even though that term had low impact on the meaning of the content. This can be factored in by taking into account the word “the” will occur at a high frequency across all documents, and the weight will be adjusted proportional to that to decrease the weight of this term.

Additional settings set in the vectorizer include:

  • minimum frequency: minimum frequency of words or ngrams. as mentioned, there will be a large number of tokens and ngrams that occur only once or a few times across the whole corpus . This will greatly expand the size of the vocabulary, especially when considering a large number of ngrams, and so a way to limit this is to set a restriction in the vectorizer, than only allows terms that appear a minimum number of times to exist in the vocabulary. For this set of experiments this value was set at 10.
  • maximum frequency: Maximum frequency is a way of limiting the vocabulary to reduce the number of stop words. Instead of using a built in stopwords library, we can simply set a restriction on the maximum allowable frequency of terms. This means, that terms that appear in every document will be excluded because those terms will not have a big impact on the classifier anyways, especially given the tfidf weighting, and these term weights will likely just contribute noise and increase the dimensionality unnecessarily.

Analysis by Classifier and Representations

The results from testing all the representation previously described is summarized in the folloing tables. The vectorizers are named according to the number of ngrams, whether it was character based “char”, and the type of custom tokenization (the bert method) and pre-processing (all characters included, no numbers, and only letter characters included).

Two linear classifiers were tested: logistic regression and support vector machines (SVM), as well as an ensemble method for comparison. A balanced cost functions was used, and a normal cost function was also used for SVM.

count mean std min 25% 50% 75% max
classifier vectorizer
logregcv 1gram 21.0 0.387130 0.291712 0.000000 0.216216 0.341463 0.714286 0.835821
3gram 21.0 0.404456 0.291235 0.000000 0.117647 0.418605 0.690909 0.823529
char 21.0 0.415545 0.289436 0.000000 0.227273 0.419753 0.690909 0.869565
cust_all-1gram 21.0 0.376788 0.316743 0.000000 0.050000 0.341463 0.742515 0.865672
cust_all-3gram 21.0 0.379196 0.301674 0.000000 0.121212 0.305085 0.750000 0.882353
cust_no_nums-1gram 21.0 0.375641 0.316426 0.000000 0.050000 0.341463 0.750751 0.865672
cust_no_nums-3gram 21.0 0.393199 0.302940 0.000000 0.121212 0.392157 0.750000 0.882353
cust_only_alpha-1gram 21.0 0.377285 0.308369 0.000000 0.171429 0.333333 0.750000 0.852941
cust_only_alpha-3gram 21.0 0.401743 0.295126 0.000000 0.121212 0.435294 0.727273 0.848485
logregcv_balanced 1gram 21.0 0.495358 0.210163 0.062500 0.380952 0.488889 0.695652 0.821918
3gram 21.0 0.508967 0.212936 0.074074 0.380952 0.483871 0.718750 0.810811
char 21.0 0.521330 0.182913 0.190476 0.379310 0.502415 0.666667 0.857143
cust_all-1gram 21.0 0.491233 0.214054 0.064516 0.352941 0.497653 0.707692 0.833333
cust_all-3gram 21.0 0.499426 0.209599 0.083333 0.333333 0.481132 0.709677 0.837838
cust_no_nums-1gram 21.0 0.497385 0.212544 0.066667 0.366197 0.461538 0.730159 0.861111
cust_no_nums-3gram 21.0 0.499315 0.208580 0.086957 0.347826 0.479592 0.700000 0.873239
cust_only_alpha-1gram 21.0 0.492733 0.217237 0.076923 0.322581 0.478873 0.730159 0.849315
cust_only_alpha-3gram 21.0 0.510576 0.208496 0.074074 0.416667 0.480392 0.730159 0.837838
random_forest_balanced 1gram 21.0 0.311567 0.293623 0.000000 0.090909 0.187500 0.701754 0.779661
3gram 21.0 0.340907 0.266239 0.000000 0.146341 0.227273 0.666667 0.779661
char 21.0 0.365583 0.263516 0.000000 0.181818 0.251852 0.689655 0.800000
cust_all-1gram 21.0 0.312634 0.283756 0.000000 0.125000 0.174603 0.689655 0.779661
cust_all-3gram 21.0 0.331651 0.272659 0.000000 0.111111 0.229008 0.654545 0.779661
cust_no_nums-1gram 21.0 0.302543 0.291940 0.000000 0.097561 0.160000 0.701754 0.779661
cust_no_nums-3gram 21.0 0.332594 0.269699 0.000000 0.115702 0.227273 0.654545 0.779661
cust_only_alpha-1gram 21.0 0.311563 0.293241 0.000000 0.090909 0.181818 0.701754 0.779661
cust_only_alpha-3gram 21.0 0.331593 0.282147 0.000000 0.121212 0.227273 0.666667 0.779661
svm_balanced 1gram 21.0 0.528486 0.191674 0.000000 0.454545 0.522727 0.702703 0.842105
3gram 21.0 0.527106 0.190601 0.062500 0.428571 0.500000 0.687500 0.853333
char 21.0 0.537184 0.193380 0.000000 0.444444 0.523364 0.718750 0.833333
cust_all-1gram 21.0 0.517229 0.186578 0.060606 0.413793 0.526718 0.649351 0.820513
cust_all-3gram 21.0 0.527482 0.183058 0.080000 0.444444 0.513274 0.696970 0.794872
cust_no_nums-1gram 21.0 0.517215 0.184749 0.064516 0.413793 0.514286 0.649351 0.810127
cust_no_nums-3gram 21.0 0.521650 0.188471 0.080000 0.454545 0.500000 0.686567 0.810127
cust_only_alpha-1gram 21.0 0.523520 0.181054 0.066667 0.423529 0.529412 0.657895 0.810127
cust_only_alpha-3gram 21.0 0.527546 0.190969 0.071429 0.437500 0.512821 0.707692 0.810127

Analysis of Classifier on Full Datasets

Sentence Based

count mean std min 25% 50% 75% max
classifier
logregcv 27.0 0.538240 0.026181 0.502447 0.521963 0.528771 0.563489 0.583333
logregcv_balanced 27.0 0.560509 0.032042 0.504496 0.532829 0.565003 0.591144 0.606166
random_forest_balanced 27.0 0.506861 0.008723 0.489726 0.502209 0.505082 0.509686 0.532418
svm_balanced 27.0 0.552539 0.026245 0.511447 0.536557 0.550852 0.573312 0.607453

Excerpt Based

count mean std min 25% 50% 75% max
classifier
logregcv 27.0 0.606388 0.029977 0.548485 0.592954 0.600801 0.624813 0.658854
logregcv_balanced 27.0 0.616874 0.023957 0.576471 0.600716 0.612943 0.629839 0.661818
random_forest_balanced 27.0 0.470446 0.021622 0.440141 0.454623 0.464883 0.489271 0.507993
svm_balanced 27.0 0.614373 0.028834 0.566914 0.590567 0.618690 0.629487 0.670190

Analysis of Representation on Full datasets

Sentence Based

count mean std min 25% 50% 75% max
vectorizer
1gram 12.0 0.523218 0.014292 0.502758 0.512716 0.524475 0.531200 0.546624
3gram 12.0 0.551585 0.032038 0.504334 0.528432 0.556689 0.574398 0.593997
char 12.0 0.570441 0.031584 0.518699 0.548608 0.579456 0.590890 0.607453
cust_all-1gram 12.0 0.517531 0.016412 0.496622 0.502655 0.517321 0.532072 0.542048
cust_all-3gram 12.0 0.548571 0.034563 0.498834 0.518188 0.558592 0.578283 0.590374
cust_no_nums-1gram 12.0 0.519189 0.017042 0.489726 0.505752 0.520005 0.534048 0.540292
cust_no_nums-3gram 12.0 0.550599 0.033735 0.505082 0.518338 0.557842 0.578534 0.593817
cust_only_alpha-1gram 12.0 0.518241 0.013186 0.501186 0.507472 0.517567 0.526971 0.537879
cust_only_alpha-3gram 12.0 0.556459 0.034341 0.506245 0.526145 0.568348 0.578001 0.602856

Excerpt Based

count mean std min 25% 50% 75% max
vectorizer
1gram 12.0 0.556811 0.060908 0.440141 0.540708 0.577642 0.601289 0.606838
3gram 12.0 0.587291 0.075564 0.451049 0.563028 0.620402 0.637051 0.658854
char 12.0 0.600079 0.070327 0.469178 0.579143 0.620645 0.648738 0.670190
cust_all-1gram 12.0 0.560850 0.062405 0.440141 0.547690 0.587992 0.597148 0.623542
cust_all-3gram 12.0 0.589742 0.072032 0.457539 0.576271 0.618028 0.633470 0.653504
cust_no_nums-1gram 12.0 0.555949 0.060072 0.445614 0.536471 0.584496 0.596083 0.604692
cust_no_nums-3gram 12.0 0.589490 0.070813 0.457539 0.577209 0.620488 0.632463 0.648438
cust_only_alpha-1gram 12.0 0.557290 0.060581 0.440141 0.535464 0.579711 0.595053 0.622642
cust_only_alpha-3gram 12.0 0.595678 0.071185 0.459413 0.578810 0.623086 0.639356 0.661433

Summary of Performance Results

The most notable effect on performance from the various representations was that character based consistently performed higher than other methods. However, the difference between representation methods was almost negligeable, considering the amount of variation between folds in the 3-cross validation.

In the comparison of classifiers, it is clear overall the linear methods perform best, and the balanced SVM was slightly better in most cases.