Recap

This post will recap in a quick summary what was accomplished during the summer working on this project with RedHen and Dr. Glik’s research team at UCLA. The main accomplishments are analysis of the data from various angles, as well as performance results from various classification approaches.

Data Visualizations and Analysis

The first objective was to provide some insights into the data to researchers about what they have captured with their annotations. The visualizations and analysis includes:

visualizations of the text content of each topic
analysis of the labels for classification (class balances)
analysis of lengths of labelled text, to assess transitioning from excerpt based to sentence based labelled texts
demonstrating effects of text representation on the meaning being captured, for purposes of providing insight to researchers

These four topics was discussed in several posts, shown in the following sections

Text Visualizations

Classification results

Several variations of classifications were tried. Sklearn classifiers were tested in a big experiment. Each classifier was tested with 9 different text representation methods. The state of the art method called Bert, pre-trained model was also used and fined tuned for this classification task. Finally a big optimization experiment was run using the flair library, to test various word and document embeddings. Classifiers were tested primarily on the accountability label, but all the main labels were also tested individually. The classifiers were trained on the original excerpts, and also on labelled sentences. Finally, the results were assess using all 7 datasets together, as well as each dataset alone. There are many posts summarizing these results, listed in the following sections.

The main findings were that advanced methods such as Bert and Flair were not able to outperform the baseline sklearn classification methods. The overall fscore were comparable to the human inter-annotator agreement. While the scores were not very high, they were comparable with the range of human agreement.

Conclusions from overall results were that the excerpt based form of the data typically outperformed the sentence based, and so further investigation of grouped sentence methods should be considered, for example, a sliding window approach. The reason for this is most likely that the meaning being captured by these annotations is not preserved in the single sentence form, or breaking up the excerpts is introducing more noise into the labels.

Another main conclusion is that some events perform dramatically different for classification success for different topics, and one of the main challenges of this task is that the given labels may not generalize well across different events.

Sklearn Classifiers and Vectorizers

Bert Classifiers

Flair Classifier

Individual Event Comparisons

Comprehensive results of performance comparisons for each label on each event

Suggestions for Future Work

The three main suggestions for future investigation are: more testing related to length of unit used for classification (e.g. sentence vs excerpt), testing more representations, and modifying training set to remove examples hurting performance.

Unit used for classification (sentence vs excerpt)

One objective of this research task, was to assess the prospect of automating the process of identifying these labelled topics in new incoming datasets. In order to identify these topics in unseen articles, the input would be a full article, not set of isolated exerpts. In order to use a classifier trained on excerpts, it would also be required to identify excerpts from within a longer article, which in itself is a very challenging task to automate. The solution to this used in testing was to convert the excerpts to sentences, since sentences can be more easily identified from within an article, and would be able to use to train a classifier that could operate sentence by sentence on an article.

This worked reasonably well, however there was a noticeable drop in performance in transitioning from the excerpt level to the sentence level performance. This effect could be further investigate by training seperate classifiers on varying excerpt lengths. For example, break excerpts up into 1 sentence, 2 sentences, 3 sentences and 4 sentences. After comparing the performance at each fixed length, the best resulting classifier could be used to identify the topics within articles in a sliding window approach.

This experiment is a worthwhile next step, and could also provide insights to other researchers in the social sciences. Excerpts are typically seen as “meaning units”, and the excerpt-level annotation scheme is very common in this field of research. For further details, see articles: Systematic text condensation, A hands-on guide to doing content analysis, An Introduction to Codes and Coding.

Testing More Text Representations

As mentioned in the flair section, a systematic experiment was run to optimize the classifier, including choice of text representation method. This experiment could be run more extensively to be more thorough, including testing more combinations of stacked embeddings. Also, another state of the art library “pytorch-transformers” could be used, with models implemented in pytorch.

Also, to determine why the state-of-the-art deep learning methods are not out-performing the baseline methods, machine learning interpretability methods should be used, such as the “What-if Tool”.

Removing Training Samples detrimental to classifiers

As shown in previous posts, some subset of the data (particular events) perform much worse on particular topics. For example Isla Vista performed very well in the topic “race/culture”, while Orlando performed quite poorly. Perhaps, removing the Orlando dataset from testing and training could boost the performance overall for the other events.

Another way to determine which training examples are detrimental to performance, is to detect these examples automatically using data cleaning methods. One promising method to try is called “Data Shapely”, which uses the principles of shap and shapely values (used in game theory and more recently in feature importance analysis), to determine the impact of training examples on model errors. Then, using a threshold, the most detrimental examples could be excluded from the dataset before training.

Conclusions on Experience

This project has been a good learning experience for me, covering all the main areas of data science including many different type of visualizations and training and testing many different types of models. This has been a good opportunity to experiment with many different natural language processing techniques I have been interested to work with. I am grateful to RedHen, Google Summer of Code and Dr. Glik’s research group for providing this project.