Text Visualizations

This file will display visualizations of the text based on the labelled categories, shown as the circles on the distance plot. This plot also shows the word distributions associated with each category. The word distributions on the right show the most common words in each category when lambda=1, and the most specific words to the category when lambda = 0, computed by the relevance metric.

The categories are labelled on the plot as numbers, and the corresponding label titles are:

Topic 1: ACCOUNT, number of words: 176621
Topic 2: POLICY, number of words: 91678
Topic 3: EVENT, number of words: 98468
Topic 4: VICTIMS, number of words: 56061
Topic 5: PERPETRATOR, number of words: 77495
Topic 6: MOURNING, number of words: 48223
Topic 7: TRAUMA, number of words: 57280
Topic 8: GRIEF, number of words: 40600
Topic 9: PHOTO, number of words: 17944
Topic 10: INVESTIGATION, number of words: 20004
Topic 11: SOCIALSUPPORT, number of words: 21636
Topic 12: JOURNEY, number of words: 16369
Topic 13: MEDIA, number of words: 12786
Topic 14: RESOURCES, number of words: 12288
Topic 15: SAFETY, number of words: 11732
Topic 16: THREAT, number of words: 9303
Topic 17: MISCELLANEOUS, number of words: 4665
Topic 18: HERO, number of words: 1511
Topic 19: RACECULTURE, number of words: 1423
Topic 20: LEGAL, number of words: 1737

The size of the circles correspond to the size of that category. Also, if hovering over a word in the chart on the right, the size of the circles will adjust proportional to count of that word in each category. Clciking on a topic will display that topi’c word distribution, and clicking away on the empty part of the distance plot will show the overall word distribution of all the documents.

These results can be used to get an intuition of what the labels have captured in the text. It can give an idea how similar the topics are, by how much their circles are overlapping. It can alse give insights about which words best categorize that topic. From these results it shows that some of the labelled categories are very close together and overlapping, and some of the smaller ones are more distinct.

This plot can also be used to visually inspect the effects of the pre-processing. It can be seen that some improvements could be made to the pre-processing including using n-grams, for example grouping the words “santa” and “barbara” into a single token “santa barbara”, and using lemmas instead of stemming during tokenization. Also, it can identify aditional stop words that could be removed.