Previously:
If you’ve been following along with my previous posts, you know that I’ve been
working on prepping data from the television show Frasier. It involved a few
different ways of collecting this data, including cleaning it in BASH
,
scraping and augmenting data in R
.
Once I had the data ready, I was able to dive in. In the first part of my
analysis, I was able to use the subtools
package in R as well as data from
IMDB.com to create a time-series sentiment analysis. Orginally I thought I would
be able to do this using tidytext
but I found that using only unigrams with
the package didn’t really capture the full picture of each sentence.
The sentimentr
package proved to be useful for what I was going for.
You can
check it out here.
In the second analysis
I tackled more detailed information from the transcripts at
kacl780.net. I augmented information from those
transcripts with data from IMDB and the gender
package in R. There’s more
information available than I actually have time for at this point, so I’m
hoping that I can revisit it at some point in the future.
I’m also hoping I can combine some of the info that I didn’t use into a shiny app that will present it in a nice tidy format.
If you remember from my very first post, I really just wanted to duplicate what I saw on reddit with information from Frasier. ‘The Office’ Characters’ Most Distinguishing Words..
The original person who posted this analysis didn’t include links to their
analysis which makes reproducing it pretty difficult. Here’s what I came up with
on my own in R
:
tidyTranscripts %>%
filter(characterType == 'main') %>%
anti_join(stop_words) %>%
count(character,
word,
sort = TRUE) %>%
group_by(word) %>%
mutate(totalTimesSaid = sum(n)) %>%
ungroup() %>%
mutate(percentage = n/totalTimesSaid * n) %>%
group_by(character) %>%
mutate(characterSpoken = n()) %>%
ungroup() %>%
mutate(anyoneSpoken = n(),
uniquenessOfWord = percentage * anyoneSpoken/characterSpoken) %>%
group_by(word) %>%
top_n(1, uniquenessOfWord) %>%
ungroup() %>%
group_by(character) %>%
top_n(5, uniquenessOfWord) %>%
ungroup() %>%
select(character, word, uniquenessOfWord) %>%
arrange(character, desc(uniquenessOfWord))
Which gives us:
character | word | uniquenessOfWord |
---|---|---|
Daphne | crane | 1943.0577 |
Daphne | dr | 1857.6753 |
Daphne | mum | 479.5264 |
Daphne | bloody | 238.4252 |
Daphne | nice | 168.8845 |
Frasier | niles | 4413.9584 |
Frasier | dad | 3442.1692 |
Frasier | roz | 2905.5846 |
Frasier | god | 1102.0595 |
Frasier | time | 861.8832 |
Martin | yeah | 2141.1908 |
Martin | hey | 1785.0858 |
Martin | fras | 853.3099 |
Martin | eddie | 770.8073 |
Martin | guys | 625.8700 |
Niles | frasier | 1445.4581 |
Niles | maris | 762.1647 |
Niles | wait | 227.8986 |
Niles | mel | 136.2381 |
Niles | maris’s | 120.6549 |
Roz | alice | 348.1129 |
Roz | line | 217.9855 |
Roz | martin | 135.2960 |
Roz | roger | 125.1669 |
Roz | weird | 102.8421 |
Now this is pretty close to the original point of the reddit analysis, but I found it a bit of a challenge since some characters share the same words. For example, Frasier and Niles both say “dad” a lot, Frasier says it about twice as much as Niles. To overcome that, I just took the top character by word and gave credit to them. Also for Daphne, she says “Dr. Crane” a lot, referring to both brothers, but it’s tough to see that since we’re using unigrams instead of bigrams. You might consider removing one or the other references.
There’s still so much more that can be done with the data, I am just going to work on it in small chunks as I go.