I have an output that contains a high number of words. I want to take these in a list and find their frequency column by column in my data.
For example, my output is with frequencies in the whole data
ich 4
möchte 5
doner 3
und 2
ayran 6
and my columns are 2000, 2001, 2002, 2003. I want to find how many "ich","möchte","doner","und","ayran" are in those columns separately.
Please help me about taking this output as a list and find their frequencies, as I explained above. Please use R.
CodePudding user response:
An alternative way (using the data provided by Kris).
library(data.table)
word_count <- data.table(Y2000 = c("word1", "word2", "ich", "ich", "word5", "und"),
Y2001 = c("ich", "möchte", "word3", "ayran", "ayran", "word6"),
Y2002 = c("word1", "word2", "und", "und", "doner", "und"),
Y2003 = c("ich", "word2", "ayran", "ich", "word5", "doner"))
interesting_words = c("ich", "möchte", "doner", "und", "ayran")
wc_long = melt(word_count, measure.vars = c("Y2000", "Y2001", "Y2002", "Y2003"))
wc_long[value %chin% interesting_words, .N, by = value][order(-N)]
value N
1: ich 5
2: und 4
3: ayran 3
4: doner 2
5: möchte 1
And by year, it can be extended:
wc_long[value %chin% interesting_words, .N, by = .(value, variable)][order(-N)]
value variable N
1: und Y2002 3
2: ich Y2000 2
3: ayran Y2001 2
4: ich Y2003 2
5: und Y2000 1
6: ich Y2001 1
7: möchte Y2001 1
8: doner Y2002 1
9: ayran Y2003 1
10: doner Y2003 1
CodePudding user response:
Your question is not so clear: we do not have any sample data frame to work on and the output you require is not clearly stated.
However, you could do something along these lines:
word_count <- data.frame(Y2000 = c("word1", "word2", "ich", "ich", "word5", "und"),
Y2001 = c("ich", "möchte", "word3", "ayran", "ayran", "word6"),
Y2002 = c("word1", "word2", "und", "und", "doner", "und"),
Y2003 = c("ich", "word2", "ayran", "ich", "word5", "doner"))
interesting_words <- c("ich", "möchte", "doner", "und", "ayran")
word_count %>%
group_by(Y2000) %>%
summarize(Y2000_count = n()) %>%
filter(Y2000 %in% interesting_words)
word_count %>%
group_by(Y2001) %>%
summarize(Y2001_count = n()) %>%
filter(Y2001 %in% interesting_words)
word_count %>%
group_by(Y2002) %>%
summarize(Y2002_count = n()) %>%
filter(Y2002 %in% interesting_words)
word_count %>%
group_by(Y2003) %>%
summarize(Y2003_count = n()) %>%
filter(Y2002 %in% interesting_words)
Do not hesitate to edit your question to make it clearer.
Taking into account zx8754's comment:
word_count_years <- word_count %>%
pivot_longer(starts_with("Y"), names_to = "Year", values_to = "Word") %>%
group_by(Year, Word) %>%
summarise(Count = n()) %>%
filter(Word %in% interesting_words) %>%
pivot_wider(names_from = Year, values_from = Count)