Comparison of Text Tfidf Representations

This analysis will compare different text representations with simple sklearn linear and ensemble classifiers. The basis of all the text representations tested in this analysis is the sklearn vectorizer. Another, more common approach to represent text is based on word embeddings, but that is out of the scope of this post, and will be explained in a subsequent post.

The sklearn basic vectorizers are the focus of this analysis, because they have shown that their baseline performance previously tested was comparable in performance to state of the art text representation using BERT embeddings. Since previously no experimentation was done on the input representation to these basic linear classifiers and the classifier performance was still comparable with state of the art methods, these simple sklearn classifiers will be re-visited and tested again with different variations of the input text representation to see if any substantial improvements can be obtained.

Text Pre-Processing and Vectorization

There were several variations of each phase of the text processing pipeline tested in this experiment, as will be explained in the following sections.

The main phases are:

Text Cleaning
Text Analyzer
N-grams
Text vectorizer

Text Cleaning

There are numerous variations of what can be done during the text cleaning phase, such as removing certain terms, numbers, punctuation, lemmatizing words, lower casing words, etc. The first step applied in all cases is lower-casing all words, this is to reduce the vocabulary size and noisiness. There were several variations of subsequents steps to the text cleaning applied next:

None: input text remains as is
Removing numbers: often numeric values do not provide significant meaning and only add noise, e.g. the numbers 1, 5000, 35, etc. … if these numbers are mentioned in the text, each one would be a new term in the vocabulary.
Removing numbers and punctuation: in this variation the only characters remaining are letters, and so the vocabulary will be much cleaner, made up only of words. However this is possible that it could also lose some of the meaning connected to numbers and punctuation that was used.

Text Tokenization

The variations of text cleaning previously explained were all used with the bert tokenization method described below. Other variants of tokenization were also used, including the most simple word based approach built in to sklearn vectorizer, and also the character-based analyzer option built into the sklearn vectorizer.

Bert-based method: This method is a custom tokenization procedure, that is used for input to the BERT classifiers. Bert uses SentencePiece, an unsupervised text tokenizer that implements subword units (e.g., byte-pair-encoding). The steps followed in this procedure are:
- Word tokenize it (i.e. “susie is calling” -> [“susie”, “is”, “calling”])
- Break words into WordPieces (i.e. “calling” -> [“call”, “##ing”])
- Add special “CLS” and “SEP” tokens
- Append “index” and “segment” tokens to each input
Word Tokenization: basic tokenization based on white spaces between words
Character based tokenization: each character is considered an individual unit, and tokens are made up of character n-grams, as will be described in the following section

N-grams

N-grams determine the number of tokens to include as an individual unit to count in the document. For example, character-based 3-grams for “good morning” would be: “goo”, “ood”, “od “, “d m”, “ mo”, “mor” … etc., with each combination of 3 adjacent characters. For sequences of words, the 3-grams for a the phrase “the dog ran up the hill” are “# the dog”, “the dog ran”, “dog ran up”, … etc.. Any number of n can be chosen as the the number of n-grams to consider. The numbers chosen for this experiments were:

bert based tokens: unigrams only, unigrams bigrams and trigrams
word tokens: unigram, unigrams bigrams and trigrams
character based: n-grams from three to ten characters

Note, increasing the number of n-grams to include will greatly increase the size of the vocabulary, and so there must be methods to mitigate this great increase in vocabulary size, such as setting a maximum vocabulary size, and excluding vocabulary terms that are too rare (most terms will be rare due to zipf’s law).

Text Vectorizers

There are multiple versions of the vectorizers that can be used from sklearn including:

Count vectors: each text is translated into a vector directly based on the counts of each term in the vocabulary. For example, with a vocabulary vector of [dog, runs, the, up, hill, good], the text “The dog runs up the hill. The dog is good.” would be transformed into the vector: [dog:2, runs:1, the:3, up:1, hill:1, good:1].
term-frequency inverse-document-frequency (tfidf): is the product of two statistics, term frequency and inverse document frequency, which has the effect of weighting terms by importance. Note how the word “the” had a high weight in the previous example, even though that term had low impact on the meaning of the content. This can be factored in by taking into account the word “the” will occur at a high frequency across all documents, and the weight will be adjusted proportional to that to decrease the weight of this term.

Additional settings set in the vectorizer include:

minimum frequency: minimum frequency of words or ngrams. as mentioned, there will be a large number of tokens and ngrams that occur only once or a few times across the whole corpus . This will greatly expand the size of the vocabulary, especially when considering a large number of ngrams, and so a way to limit this is to set a restriction in the vectorizer, than only allows terms that appear a minimum number of times to exist in the vocabulary. For this set of experiments this value was set at 10.
maximum frequency: Maximum frequency is a way of limiting the vocabulary to reduce the number of stop words. Instead of using a built in stopwords library, we can simply set a restriction on the maximum allowable frequency of terms. This means, that terms that appear in every document will be excluded because those terms will not have a big impact on the classifier anyways, especially given the tfidf weighting, and these term weights will likely just contribute noise and increase the dimensionality unnecessarily.

Analysis by Classifier and Representations

The results from testing all the representation previously described is summarized in the folloing tables. The vectorizers are named according to the number of ngrams, whether it was character based “char”, and the type of custom tokenization (the bert method) and pre-processing (all characters included, no numbers, and only letter characters included).

Two linear classifiers were tested: logistic regression and support vector machines (SVM), as well as an ensemble method for comparison. A balanced cost functions was used, and a normal cost function was also used for SVM.

		count	mean	std	min	25%	50%	75%	max
classifier	vectorizer
logregcv	1gram	21.0	0.387130	0.291712	0.000000	0.216216	0.341463	0.714286	0.835821
	3gram	21.0	0.404456	0.291235	0.000000	0.117647	0.418605	0.690909	0.823529
	char	21.0	0.415545	0.289436	0.000000	0.227273	0.419753	0.690909	0.869565
	cust_all-1gram	21.0	0.376788	0.316743	0.000000	0.050000	0.341463	0.742515	0.865672
	cust_all-3gram	21.0	0.379196	0.301674	0.000000	0.121212	0.305085	0.750000	0.882353
	cust_no_nums-1gram	21.0	0.375641	0.316426	0.000000	0.050000	0.341463	0.750751	0.865672
	cust_no_nums-3gram	21.0	0.393199	0.302940	0.000000	0.121212	0.392157	0.750000	0.882353
	cust_only_alpha-1gram	21.0	0.377285	0.308369	0.000000	0.171429	0.333333	0.750000	0.852941
	cust_only_alpha-3gram	21.0	0.401743	0.295126	0.000000	0.121212	0.435294	0.727273	0.848485
logregcv_balanced	1gram	21.0	0.495358	0.210163	0.062500	0.380952	0.488889	0.695652	0.821918
	3gram	21.0	0.508967	0.212936	0.074074	0.380952	0.483871	0.718750	0.810811
	char	21.0	0.521330	0.182913	0.190476	0.379310	0.502415	0.666667	0.857143
	cust_all-1gram	21.0	0.491233	0.214054	0.064516	0.352941	0.497653	0.707692	0.833333
	cust_all-3gram	21.0	0.499426	0.209599	0.083333	0.333333	0.481132	0.709677	0.837838
	cust_no_nums-1gram	21.0	0.497385	0.212544	0.066667	0.366197	0.461538	0.730159	0.861111
	cust_no_nums-3gram	21.0	0.499315	0.208580	0.086957	0.347826	0.479592	0.700000	0.873239
	cust_only_alpha-1gram	21.0	0.492733	0.217237	0.076923	0.322581	0.478873	0.730159	0.849315
	cust_only_alpha-3gram	21.0	0.510576	0.208496	0.074074	0.416667	0.480392	0.730159	0.837838
random_forest_balanced	1gram	21.0	0.311567	0.293623	0.000000	0.090909	0.187500	0.701754	0.779661
	3gram	21.0	0.340907	0.266239	0.000000	0.146341	0.227273	0.666667	0.779661
	char	21.0	0.365583	0.263516	0.000000	0.181818	0.251852	0.689655	0.800000
	cust_all-1gram	21.0	0.312634	0.283756	0.000000	0.125000	0.174603	0.689655	0.779661
	cust_all-3gram	21.0	0.331651	0.272659	0.000000	0.111111	0.229008	0.654545	0.779661
	cust_no_nums-1gram	21.0	0.302543	0.291940	0.000000	0.097561	0.160000	0.701754	0.779661
	cust_no_nums-3gram	21.0	0.332594	0.269699	0.000000	0.115702	0.227273	0.654545	0.779661
	cust_only_alpha-1gram	21.0	0.311563	0.293241	0.000000	0.090909	0.181818	0.701754	0.779661
	cust_only_alpha-3gram	21.0	0.331593	0.282147	0.000000	0.121212	0.227273	0.666667	0.779661
svm_balanced	1gram	21.0	0.528486	0.191674	0.000000	0.454545	0.522727	0.702703	0.842105
	3gram	21.0	0.527106	0.190601	0.062500	0.428571	0.500000	0.687500	0.853333
	char	21.0	0.537184	0.193380	0.000000	0.444444	0.523364	0.718750	0.833333
	cust_all-1gram	21.0	0.517229	0.186578	0.060606	0.413793	0.526718	0.649351	0.820513
	cust_all-3gram	21.0	0.527482	0.183058	0.080000	0.444444	0.513274	0.696970	0.794872
	cust_no_nums-1gram	21.0	0.517215	0.184749	0.064516	0.413793	0.514286	0.649351	0.810127
	cust_no_nums-3gram	21.0	0.521650	0.188471	0.080000	0.454545	0.500000	0.686567	0.810127
	cust_only_alpha-1gram	21.0	0.523520	0.181054	0.066667	0.423529	0.529412	0.657895	0.810127
	cust_only_alpha-3gram	21.0	0.527546	0.190969	0.071429	0.437500	0.512821	0.707692	0.810127

Analysis of Classifier on Full Datasets

Sentence Based

	count	mean	std	min	25%	50%	75%	max
classifier
logregcv	27.0	0.538240	0.026181	0.502447	0.521963	0.528771	0.563489	0.583333
logregcv_balanced	27.0	0.560509	0.032042	0.504496	0.532829	0.565003	0.591144	0.606166
random_forest_balanced	27.0	0.506861	0.008723	0.489726	0.502209	0.505082	0.509686	0.532418
svm_balanced	27.0	0.552539	0.026245	0.511447	0.536557	0.550852	0.573312	0.607453

Excerpt Based

	count	mean	std	min	25%	50%	75%	max
classifier
logregcv	27.0	0.606388	0.029977	0.548485	0.592954	0.600801	0.624813	0.658854
logregcv_balanced	27.0	0.616874	0.023957	0.576471	0.600716	0.612943	0.629839	0.661818
random_forest_balanced	27.0	0.470446	0.021622	0.440141	0.454623	0.464883	0.489271	0.507993
svm_balanced	27.0	0.614373	0.028834	0.566914	0.590567	0.618690	0.629487	0.670190

Analysis of Representation on Full datasets

Sentence Based

	count	mean	std	min	25%	50%	75%	max
vectorizer
1gram	12.0	0.523218	0.014292	0.502758	0.512716	0.524475	0.531200	0.546624
3gram	12.0	0.551585	0.032038	0.504334	0.528432	0.556689	0.574398	0.593997
char	12.0	0.570441	0.031584	0.518699	0.548608	0.579456	0.590890	0.607453
cust_all-1gram	12.0	0.517531	0.016412	0.496622	0.502655	0.517321	0.532072	0.542048
cust_all-3gram	12.0	0.548571	0.034563	0.498834	0.518188	0.558592	0.578283	0.590374
cust_no_nums-1gram	12.0	0.519189	0.017042	0.489726	0.505752	0.520005	0.534048	0.540292
cust_no_nums-3gram	12.0	0.550599	0.033735	0.505082	0.518338	0.557842	0.578534	0.593817
cust_only_alpha-1gram	12.0	0.518241	0.013186	0.501186	0.507472	0.517567	0.526971	0.537879
cust_only_alpha-3gram	12.0	0.556459	0.034341	0.506245	0.526145	0.568348	0.578001	0.602856

Excerpt Based

	count	mean	std	min	25%	50%	75%	max
vectorizer
1gram	12.0	0.556811	0.060908	0.440141	0.540708	0.577642	0.601289	0.606838
3gram	12.0	0.587291	0.075564	0.451049	0.563028	0.620402	0.637051	0.658854
char	12.0	0.600079	0.070327	0.469178	0.579143	0.620645	0.648738	0.670190
cust_all-1gram	12.0	0.560850	0.062405	0.440141	0.547690	0.587992	0.597148	0.623542
cust_all-3gram	12.0	0.589742	0.072032	0.457539	0.576271	0.618028	0.633470	0.653504
cust_no_nums-1gram	12.0	0.555949	0.060072	0.445614	0.536471	0.584496	0.596083	0.604692
cust_no_nums-3gram	12.0	0.589490	0.070813	0.457539	0.577209	0.620488	0.632463	0.648438
cust_only_alpha-1gram	12.0	0.557290	0.060581	0.440141	0.535464	0.579711	0.595053	0.622642
cust_only_alpha-3gram	12.0	0.595678	0.071185	0.459413	0.578810	0.623086	0.639356	0.661433

Summary of Performance Results

The most notable effect on performance from the various representations was that character based consistently performed higher than other methods. However, the difference between representation methods was almost negligeable, considering the amount of variation between folds in the 3-cross validation.

In the comparison of classifiers, it is clear overall the linear methods perform best, and the balanced SVM was slightly better in most cases.