Classification with Flair
This post will describe the flair library and the approaches used. The classification task focussed on in these experiments is the original task of binary classification of accountability. Also, it will be applied on the sentence level (instead of the excerpt level). See previous posts for explanation of this dataset and task.
Flair Library
Flair is an open source library for natural language processing created by Zalando research. Flair is “a very simple framework for state-of-the-art natural language processing (NLP)”. The main benefit of this library is that it makes it easy to implement and experiment with many state of the art methods within the same framework, in particular experiments with different text representations (embeddings).
For additional resources explaining flair, please view:
Testing Flair Word Embeddings and Document Embeddings
Simple experiments were implemented in the [notebook in github] (https://github.com/anjapago/AnalyzeAccountability/blob/master/flair_analysis.ipynb).
The first simple approach tested was document pooling and standard glove embeddings. I also tested the recommended settings of glove embeddings stacked with the flair forward and backward embeddings, and I also experimented with the document RNN and LSTM. None of the results showed a noticeable improvement in fscore.
Optimization Experiment
Hyperopt is used to optimize various parameters of the flair classifiers, including the choice of word-embeddings. The search space of parameters used is:
opt_embeddings = [[WordEmbeddings('glove')],
[WordEmbeddings('en-news')],
[BytePairEmbeddings('en')],
[OneHotEmbeddings(corpus)],
[StackedEmbeddings([WordEmbeddings('glove'),
FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward'),
])]#,
#[ELMoEmbeddings()]
]
search_space.add(Parameter.EMBEDDINGS, hp.choice, options=opt_embeddings)
search_space.add(Parameter.HIDDEN_SIZE, hp.choice, options=[32, 64, 128])
search_space.add(Parameter.RNN_LAYERS, hp.choice, options=[1, 2])
search_space.add(Parameter.DROPOUT, hp.uniform, low=0.0, high=0.5)
search_space.add(Parameter.LEARNING_RATE, hp.choice, options=[0.05, 0.1, 0.15, 0.2])
search_space.add(Parameter.MINI_BATCH_SIZE, hp.choice, options=[8, 16, 32])
The full code can be found in the github repository.
The following show the results from each evaluation run. Each evaluation run would run for 100 epochs and take around 12 hours to complete. There are 7 runs, so the code took around 3 days to produce these results, running with 16-cores. The results show the settings of each parameter, and the performance on the test set (“test_score”). The test score shown is the f score for the accountability label. The results are similar to what have seen from previous methods, including the simple sklearn vectorizers and classifiers, and with Bert, which have all acheived results in the same ball park around fscore = 0.6 for sentence based classification on accountability label.
evaluation run 1
dropout: 0.005710496817261157
embeddings: en-fasttext-news-300d-1M
hidden_size: 64
learning_rate: 0.2
mini_batch_size: 32
rnn_layers: 2
score: 0.34192222222222224
variance: 8.755555555555502e-07
test_score: 0.6134
----------------------------------------------------------------------------------------------------
evaluation run 2
dropout: 0.41645183851542744
embeddings: /home/axp797/axp797gallinahome/AnalyzeAccountability/.flair/embeddings/glove.gensim
hidden_size: 128
learning_rate: 0.15
mini_batch_size: 16
rnn_layers: 2
score: 0.3705888888888889
variance: 1.6962962962963451e-06
test_score: 0.5806
----------------------------------------------------------------------------------------------------
evaluation run 3
dropout: 0.06841856998276308
embeddings: glove.gensim
hidden_size: 128
learning_rate: 0.05
mini_batch_size: 8
rnn_layers: 1
score: 0.3611666666666667
variance: 1.370370370370381e-07
test_score: 0.6006
----------------------------------------------------------------------------------------------------
evaluation run 4
dropout: 0.43481261811291233
embeddings: en-fasttext-news-300d-1M
hidden_size: 64
learning_rate: 0.1
mini_batch_size: 32
rnn_layers: 1
score: 0.4129777777777777
variance: 1.5679259259259392e-05
test_score: 0.4822
----------------------------------------------------------------------------------------------------
evaluation run 5
dropout: 0.20850069217348033
embeddings: bpe-en-100000-50
hidden_size: 32
learning_rate: 0.1
mini_batch_size: 32
rnn_layers: 2
score: 0.4092222222222222
variance: 1.5851851851853048e-07
test_score: 0.5348
----------------------------------------------------------------------------------------------------
evaluation run 6
dropout: 0.31908978585395176
embeddings: en-fasttext-news-300d-1M
hidden_size: 128
learning_rate: 0.15
mini_batch_size: 8
rnn_layers: 2
score: 0.3538666666666667
variance: 2.4592592592592355e-07
test_score: 0.6359
----------------------------------------------------------------------------------------------------
evaluation run 7
dropout: 0.20295345479912913
embeddings: one-hot
hidden_size: 32
learning_rate: 0.1
mini_batch_size: 8
rnn_layers: 2
score: 0.3832
variance: 4.325925925926223e-07
test_score: 0.6041
----------------------------------------------------------------------------------------------------
To produce these results, the code was run in an HPC, using a singularity container. The singularity recipe can be found in the github.