Comparison of Text Tfidf Representations
This analysis will compare different text representations with simple sklearn linear and ensemble classifiers. The basis of all the text representations tested in this analysis is the sklearn vectorizer. Another, more common approach to represent text is based on word embeddings, but that is out of the scope of this post, and will be explained in a subsequent post.
The sklearn basic vectorizers are the focus of this analysis, because they have shown that their baseline performance previously tested was comparable in performance to state of the art text representation using BERT embeddings. Since previously no experimentation was done on the input representation to these basic linear classifiers and the classifier performance was still comparable with state of the art methods, these simple sklearn classifiers will be re-visited and tested again with different variations of the input text representation to see if any substantial improvements can be obtained.
Text Pre-Processing and Vectorization
There were several variations of each phase of the text processing pipeline tested in this experiment, as will be explained in the following sections.
The main phases are:
- Text Cleaning
- Text Analyzer
- N-grams
- Text vectorizer
Text Cleaning
There are numerous variations of what can be done during the text cleaning phase, such as removing certain terms, numbers, punctuation, lemmatizing words, lower casing words, etc. The first step applied in all cases is lower-casing all words, this is to reduce the vocabulary size and noisiness. There were several variations of subsequents steps to the text cleaning applied next:
- None: input text remains as is
- Removing numbers: often numeric values do not provide significant meaning and only add noise, e.g. the numbers 1, 5000, 35, etc. … if these numbers are mentioned in the text, each one would be a new term in the vocabulary.
- Removing numbers and punctuation: in this variation the only characters remaining are letters, and so the vocabulary will be much cleaner, made up only of words. However this is possible that it could also lose some of the meaning connected to numbers and punctuation that was used.
Text Tokenization
The variations of text cleaning previously explained were all used with the bert tokenization method described below. Other variants of tokenization were also used, including the most simple word based approach built in to sklearn vectorizer, and also the character-based analyzer option built into the sklearn vectorizer.
- Bert-based method: This method is a custom tokenization procedure, that is used for input to the BERT classifiers. Bert uses SentencePiece, an unsupervised text tokenizer that implements subword units (e.g., byte-pair-encoding). The steps followed in this procedure are:
- Word tokenize it (i.e. “susie is calling” -> [“susie”, “is”, “calling”])
- Break words into WordPieces (i.e. “calling” -> [“call”, “##ing”])
- Add special “CLS” and “SEP” tokens
- Append “index” and “segment” tokens to each input
- Word Tokenization: basic tokenization based on white spaces between words
- Character based tokenization: each character is considered an individual unit, and tokens are made up of character n-grams, as will be described in the following section
N-grams
N-grams determine the number of tokens to include as an individual unit to count in the document. For example, character-based 3-grams for “good morning” would be: “goo”, “ood”, “od “, “d m”, “ mo”, “mor” … etc., with each combination of 3 adjacent characters. For sequences of words, the 3-grams for a the phrase “the dog ran up the hill” are “# the dog”, “the dog ran”, “dog ran up”, … etc.. Any number of n can be chosen as the the number of n-grams to consider. The numbers chosen for this experiments were:
- bert based tokens: unigrams only, unigrams bigrams and trigrams
- word tokens: unigram, unigrams bigrams and trigrams
- character based: n-grams from three to ten characters
Note, increasing the number of n-grams to include will greatly increase the size of the vocabulary, and so there must be methods to mitigate this great increase in vocabulary size, such as setting a maximum vocabulary size, and excluding vocabulary terms that are too rare (most terms will be rare due to zipf’s law).
Text Vectorizers
There are multiple versions of the vectorizers that can be used from sklearn including:
- Count vectors: each text is translated into a vector directly based on the counts of each term in the vocabulary. For example, with a vocabulary vector of [dog, runs, the, up, hill, good], the text “The dog runs up the hill. The dog is good.” would be transformed into the vector: [dog:2, runs:1, the:3, up:1, hill:1, good:1].
- term-frequency inverse-document-frequency (tfidf): is the product of two statistics, term frequency and inverse document frequency, which has the effect of weighting terms by importance. Note how the word “the” had a high weight in the previous example, even though that term had low impact on the meaning of the content. This can be factored in by taking into account the word “the” will occur at a high frequency across all documents, and the weight will be adjusted proportional to that to decrease the weight of this term.
Additional settings set in the vectorizer include:
- minimum frequency: minimum frequency of words or ngrams. as mentioned, there will be a large number of tokens and ngrams that occur only once or a few times across the whole corpus . This will greatly expand the size of the vocabulary, especially when considering a large number of ngrams, and so a way to limit this is to set a restriction in the vectorizer, than only allows terms that appear a minimum number of times to exist in the vocabulary. For this set of experiments this value was set at 10.
- maximum frequency: Maximum frequency is a way of limiting the vocabulary to reduce the number of stop words. Instead of using a built in stopwords library, we can simply set a restriction on the maximum allowable frequency of terms. This means, that terms that appear in every document will be excluded because those terms will not have a big impact on the classifier anyways, especially given the tfidf weighting, and these term weights will likely just contribute noise and increase the dimensionality unnecessarily.
Analysis by Classifier and Representations
The results from testing all the representation previously described is summarized in the folloing tables. The vectorizers are named according to the number of ngrams, whether it was character based “char”, and the type of custom tokenization (the bert method) and pre-processing (all characters included, no numbers, and only letter characters included).
Two linear classifiers were tested: logistic regression and support vector machines (SVM), as well as an ensemble method for comparison. A balanced cost functions was used, and a normal cost function was also used for SVM.
count | mean | std | min | 25% | 50% | 75% | max | ||
---|---|---|---|---|---|---|---|---|---|
classifier | vectorizer | ||||||||
logregcv | 1gram | 21.0 | 0.387130 | 0.291712 | 0.000000 | 0.216216 | 0.341463 | 0.714286 | 0.835821 |
3gram | 21.0 | 0.404456 | 0.291235 | 0.000000 | 0.117647 | 0.418605 | 0.690909 | 0.823529 | |
char | 21.0 | 0.415545 | 0.289436 | 0.000000 | 0.227273 | 0.419753 | 0.690909 | 0.869565 | |
cust_all-1gram | 21.0 | 0.376788 | 0.316743 | 0.000000 | 0.050000 | 0.341463 | 0.742515 | 0.865672 | |
cust_all-3gram | 21.0 | 0.379196 | 0.301674 | 0.000000 | 0.121212 | 0.305085 | 0.750000 | 0.882353 | |
cust_no_nums-1gram | 21.0 | 0.375641 | 0.316426 | 0.000000 | 0.050000 | 0.341463 | 0.750751 | 0.865672 | |
cust_no_nums-3gram | 21.0 | 0.393199 | 0.302940 | 0.000000 | 0.121212 | 0.392157 | 0.750000 | 0.882353 | |
cust_only_alpha-1gram | 21.0 | 0.377285 | 0.308369 | 0.000000 | 0.171429 | 0.333333 | 0.750000 | 0.852941 | |
cust_only_alpha-3gram | 21.0 | 0.401743 | 0.295126 | 0.000000 | 0.121212 | 0.435294 | 0.727273 | 0.848485 | |
logregcv_balanced | 1gram | 21.0 | 0.495358 | 0.210163 | 0.062500 | 0.380952 | 0.488889 | 0.695652 | 0.821918 |
3gram | 21.0 | 0.508967 | 0.212936 | 0.074074 | 0.380952 | 0.483871 | 0.718750 | 0.810811 | |
char | 21.0 | 0.521330 | 0.182913 | 0.190476 | 0.379310 | 0.502415 | 0.666667 | 0.857143 | |
cust_all-1gram | 21.0 | 0.491233 | 0.214054 | 0.064516 | 0.352941 | 0.497653 | 0.707692 | 0.833333 | |
cust_all-3gram | 21.0 | 0.499426 | 0.209599 | 0.083333 | 0.333333 | 0.481132 | 0.709677 | 0.837838 | |
cust_no_nums-1gram | 21.0 | 0.497385 | 0.212544 | 0.066667 | 0.366197 | 0.461538 | 0.730159 | 0.861111 | |
cust_no_nums-3gram | 21.0 | 0.499315 | 0.208580 | 0.086957 | 0.347826 | 0.479592 | 0.700000 | 0.873239 | |
cust_only_alpha-1gram | 21.0 | 0.492733 | 0.217237 | 0.076923 | 0.322581 | 0.478873 | 0.730159 | 0.849315 | |
cust_only_alpha-3gram | 21.0 | 0.510576 | 0.208496 | 0.074074 | 0.416667 | 0.480392 | 0.730159 | 0.837838 | |
random_forest_balanced | 1gram | 21.0 | 0.311567 | 0.293623 | 0.000000 | 0.090909 | 0.187500 | 0.701754 | 0.779661 |
3gram | 21.0 | 0.340907 | 0.266239 | 0.000000 | 0.146341 | 0.227273 | 0.666667 | 0.779661 | |
char | 21.0 | 0.365583 | 0.263516 | 0.000000 | 0.181818 | 0.251852 | 0.689655 | 0.800000 | |
cust_all-1gram | 21.0 | 0.312634 | 0.283756 | 0.000000 | 0.125000 | 0.174603 | 0.689655 | 0.779661 | |
cust_all-3gram | 21.0 | 0.331651 | 0.272659 | 0.000000 | 0.111111 | 0.229008 | 0.654545 | 0.779661 | |
cust_no_nums-1gram | 21.0 | 0.302543 | 0.291940 | 0.000000 | 0.097561 | 0.160000 | 0.701754 | 0.779661 | |
cust_no_nums-3gram | 21.0 | 0.332594 | 0.269699 | 0.000000 | 0.115702 | 0.227273 | 0.654545 | 0.779661 | |
cust_only_alpha-1gram | 21.0 | 0.311563 | 0.293241 | 0.000000 | 0.090909 | 0.181818 | 0.701754 | 0.779661 | |
cust_only_alpha-3gram | 21.0 | 0.331593 | 0.282147 | 0.000000 | 0.121212 | 0.227273 | 0.666667 | 0.779661 | |
svm_balanced | 1gram | 21.0 | 0.528486 | 0.191674 | 0.000000 | 0.454545 | 0.522727 | 0.702703 | 0.842105 |
3gram | 21.0 | 0.527106 | 0.190601 | 0.062500 | 0.428571 | 0.500000 | 0.687500 | 0.853333 | |
char | 21.0 | 0.537184 | 0.193380 | 0.000000 | 0.444444 | 0.523364 | 0.718750 | 0.833333 | |
cust_all-1gram | 21.0 | 0.517229 | 0.186578 | 0.060606 | 0.413793 | 0.526718 | 0.649351 | 0.820513 | |
cust_all-3gram | 21.0 | 0.527482 | 0.183058 | 0.080000 | 0.444444 | 0.513274 | 0.696970 | 0.794872 | |
cust_no_nums-1gram | 21.0 | 0.517215 | 0.184749 | 0.064516 | 0.413793 | 0.514286 | 0.649351 | 0.810127 | |
cust_no_nums-3gram | 21.0 | 0.521650 | 0.188471 | 0.080000 | 0.454545 | 0.500000 | 0.686567 | 0.810127 | |
cust_only_alpha-1gram | 21.0 | 0.523520 | 0.181054 | 0.066667 | 0.423529 | 0.529412 | 0.657895 | 0.810127 | |
cust_only_alpha-3gram | 21.0 | 0.527546 | 0.190969 | 0.071429 | 0.437500 | 0.512821 | 0.707692 | 0.810127 |
Analysis of Classifier on Full Datasets
Sentence Based
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
classifier | ||||||||
logregcv | 27.0 | 0.538240 | 0.026181 | 0.502447 | 0.521963 | 0.528771 | 0.563489 | 0.583333 |
logregcv_balanced | 27.0 | 0.560509 | 0.032042 | 0.504496 | 0.532829 | 0.565003 | 0.591144 | 0.606166 |
random_forest_balanced | 27.0 | 0.506861 | 0.008723 | 0.489726 | 0.502209 | 0.505082 | 0.509686 | 0.532418 |
svm_balanced | 27.0 | 0.552539 | 0.026245 | 0.511447 | 0.536557 | 0.550852 | 0.573312 | 0.607453 |
Excerpt Based
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
classifier | ||||||||
logregcv | 27.0 | 0.606388 | 0.029977 | 0.548485 | 0.592954 | 0.600801 | 0.624813 | 0.658854 |
logregcv_balanced | 27.0 | 0.616874 | 0.023957 | 0.576471 | 0.600716 | 0.612943 | 0.629839 | 0.661818 |
random_forest_balanced | 27.0 | 0.470446 | 0.021622 | 0.440141 | 0.454623 | 0.464883 | 0.489271 | 0.507993 |
svm_balanced | 27.0 | 0.614373 | 0.028834 | 0.566914 | 0.590567 | 0.618690 | 0.629487 | 0.670190 |
Analysis of Representation on Full datasets
Sentence Based
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
vectorizer | ||||||||
1gram | 12.0 | 0.523218 | 0.014292 | 0.502758 | 0.512716 | 0.524475 | 0.531200 | 0.546624 |
3gram | 12.0 | 0.551585 | 0.032038 | 0.504334 | 0.528432 | 0.556689 | 0.574398 | 0.593997 |
char | 12.0 | 0.570441 | 0.031584 | 0.518699 | 0.548608 | 0.579456 | 0.590890 | 0.607453 |
cust_all-1gram | 12.0 | 0.517531 | 0.016412 | 0.496622 | 0.502655 | 0.517321 | 0.532072 | 0.542048 |
cust_all-3gram | 12.0 | 0.548571 | 0.034563 | 0.498834 | 0.518188 | 0.558592 | 0.578283 | 0.590374 |
cust_no_nums-1gram | 12.0 | 0.519189 | 0.017042 | 0.489726 | 0.505752 | 0.520005 | 0.534048 | 0.540292 |
cust_no_nums-3gram | 12.0 | 0.550599 | 0.033735 | 0.505082 | 0.518338 | 0.557842 | 0.578534 | 0.593817 |
cust_only_alpha-1gram | 12.0 | 0.518241 | 0.013186 | 0.501186 | 0.507472 | 0.517567 | 0.526971 | 0.537879 |
cust_only_alpha-3gram | 12.0 | 0.556459 | 0.034341 | 0.506245 | 0.526145 | 0.568348 | 0.578001 | 0.602856 |
Excerpt Based
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
vectorizer | ||||||||
1gram | 12.0 | 0.556811 | 0.060908 | 0.440141 | 0.540708 | 0.577642 | 0.601289 | 0.606838 |
3gram | 12.0 | 0.587291 | 0.075564 | 0.451049 | 0.563028 | 0.620402 | 0.637051 | 0.658854 |
char | 12.0 | 0.600079 | 0.070327 | 0.469178 | 0.579143 | 0.620645 | 0.648738 | 0.670190 |
cust_all-1gram | 12.0 | 0.560850 | 0.062405 | 0.440141 | 0.547690 | 0.587992 | 0.597148 | 0.623542 |
cust_all-3gram | 12.0 | 0.589742 | 0.072032 | 0.457539 | 0.576271 | 0.618028 | 0.633470 | 0.653504 |
cust_no_nums-1gram | 12.0 | 0.555949 | 0.060072 | 0.445614 | 0.536471 | 0.584496 | 0.596083 | 0.604692 |
cust_no_nums-3gram | 12.0 | 0.589490 | 0.070813 | 0.457539 | 0.577209 | 0.620488 | 0.632463 | 0.648438 |
cust_only_alpha-1gram | 12.0 | 0.557290 | 0.060581 | 0.440141 | 0.535464 | 0.579711 | 0.595053 | 0.622642 |
cust_only_alpha-3gram | 12.0 | 0.595678 | 0.071185 | 0.459413 | 0.578810 | 0.623086 | 0.639356 | 0.661433 |
Summary of Performance Results
The most notable effect on performance from the various representations was that character based consistently performed higher than other methods. However, the difference between representation methods was almost negligeable, considering the amount of variation between folds in the 3-cross validation.
In the comparison of classifiers, it is clear overall the linear methods perform best, and the balanced SVM was slightly better in most cases.