Home > front end >  Find listed specific words in a column [R programming]
Find listed specific words in a column [R programming]

Time:11-04

I have an output that contains a high number of words. I want to take these in a list and find their frequency column by column in my data.

For example, my output is with frequencies in the whole data

ich         4
möchte      5
doner       3
und         2
ayran       6

and my columns are 2000, 2001, 2002, 2003. I want to find how many "ich","möchte","doner","und","ayran" are in those columns separately.

Please help me about taking this output as a list and find their frequencies, as I explained above. Please use R.

CodePudding user response:

An alternative way (using the data provided by Kris).

library(data.table)

word_count <- data.table(Y2000 = c("word1", "word2", "ich", "ich", "word5", "und"),
                         Y2001 = c("ich", "möchte", "word3", "ayran", "ayran", "word6"),
                         Y2002 = c("word1", "word2", "und", "und", "doner", "und"),
                         Y2003 = c("ich", "word2", "ayran", "ich", "word5", "doner"))

interesting_words = c("ich", "möchte", "doner", "und", "ayran")

wc_long = melt(word_count, measure.vars = c("Y2000", "Y2001", "Y2002", "Y2003"))

wc_long[value %chin% interesting_words, .N, by = value][order(-N)]

    value N
1:    ich 5
2:    und 4
3:  ayran 3
4:  doner 2
5: möchte 1

And by year, it can be extended:

wc_long[value %chin% interesting_words, .N, by = .(value, variable)][order(-N)]

     value variable N
 1:    und    Y2002 3
 2:    ich    Y2000 2
 3:  ayran    Y2001 2
 4:    ich    Y2003 2
 5:    und    Y2000 1
 6:    ich    Y2001 1
 7: möchte    Y2001 1
 8:  doner    Y2002 1
 9:  ayran    Y2003 1
10:  doner    Y2003 1

CodePudding user response:

Your question is not so clear: we do not have any sample data frame to work on and the output you require is not clearly stated.

However, you could do something along these lines:

word_count <- data.frame(Y2000 = c("word1", "word2", "ich", "ich", "word5", "und"),
                         Y2001 = c("ich", "möchte", "word3", "ayran", "ayran", "word6"),
                         Y2002 = c("word1", "word2", "und", "und", "doner", "und"),
                         Y2003 = c("ich", "word2", "ayran", "ich", "word5", "doner"))

interesting_words <- c("ich", "möchte", "doner", "und", "ayran")

word_count %>%
  group_by(Y2000) %>%
  summarize(Y2000_count = n()) %>%
  filter(Y2000 %in% interesting_words)

word_count %>%
  group_by(Y2001) %>%
  summarize(Y2001_count = n()) %>%
  filter(Y2001 %in% interesting_words)

word_count %>%
  group_by(Y2002) %>%
  summarize(Y2002_count = n()) %>%
  filter(Y2002 %in% interesting_words)

word_count %>%
  group_by(Y2003) %>%
  summarize(Y2003_count = n()) %>%
  filter(Y2002 %in% interesting_words)

Do not hesitate to edit your question to make it clearer.

Taking into account zx8754's comment:

word_count_years <- word_count %>%
  pivot_longer(starts_with("Y"), names_to = "Year", values_to = "Word") %>%
  group_by(Year, Word) %>%
  summarise(Count = n()) %>%
  filter(Word %in% interesting_words) %>%
  pivot_wider(names_from = Year, values_from = Count)
  • Related