Home > front end >  How do I find the most common words in a character vector in R?
How do I find the most common words in a character vector in R?

Time:02-27

I am analysing some fmri data – in particular, I am looking at what sorts of cognitive functions are associated with coordinates from an fmri scan (conducted while subjects were performing a task. My data can be obtained with the following function:

library(httr)
scrape_and_sort = function(neurosynth_link){
  result = content(GET(neurosynth_link), "parsed")$data
  names  = c("Name", "z_score", "post_prob", "func_con", "meta_analytic")
  df = do.call(rbind, lapply(result, function(x) setNames(as.data.frame(x), names)))
  df$z_score = as.numeric(df$z_score)
  df = df[order(-df$z_score), ]
  df = df[-which(df$z_score<3),]
  df = na.omit(df)
  return(df)
}
 RO4 = scrape_and_sort('https://neurosynth.org/api/locations/-58_-22_6_6/compare')

Now, I want know which key words are coming up most often and ideally construct a list of the most common words. I tried the following:

sort(table(RO4$Name),decreasing=TRUE)

But this clearly won't work.The problem is that the names (for example: "auditory cortex") are strings with multiple words in, so results such 'auditory' and 'auditory cortex' come out as two separate entries, whereas I want them counted as two instances of 'auditory'.

But I am not sure how to search inside each string and record individual words like that. Any ideas?

CodePudding user response:

Not sure to understand. Can't you proceed like this:

x <- c("auditory cortex", "auditory", "auditory", "hello friend")
unlist(strsplit(x, " "))
# "auditory" "cortex"   "auditory" "auditory" "hello"    "friend"

CodePudding user response:

using packages {jsonlite}, {dplyr} and the pipe operator %>% for legibility:

  1. store response as dataframe df
url <- 'https://neurosynth.org/api/locations/-58_-22_6_6/compare/'
df <- jsonlite::fromJSON(url) %>% as.data.frame
  1. reshape and aggregate
df %>%
    ## keep first column only and name it 'keywords':
    select('keywords' = 1) %>%
    ## multiple cell values (as separated by a blank)
    ## into separate rows:
    separate_rows(keywords, sep = " ") %>%
    group_by(keywords) %>%
    summarise(count = n()) %>%
    arrange(desc(count))

result:

  # A tibble: 965 x 2
   keywords count
   <chr>    <int>
 1 cortex      53
 2 gyrus       26
 3 temporal    26
 4 parietal    23
 5 task        22
 6 anterior    19
 7 frontal     18
 8 visual      17
 9 memory      16
10 motor       16
# ... with 955 more rows

edit: or, if you want to proceed from your dataframe

RO4 %>%
    select(Name) %>%
    ## select(everything())
    ## select(Name:func_con)
    separate_rows(Name, sep=' ') %>%
    ## do remaining stuff

You can of course select more columns in a number of convenient ways (see commented lines above and ?dplyr::select). Mind that values of the other variables will repeated as many times as rows are needed to accomodate any multivalue in column "Name", so that will introduce some redundancy.

If you want to adopt {dplyr} style, arranging by descending z-score and excluding unwanted z-scores would read like this:

RO4 %>%
    filter(z_score < 3 & !is.na(z_score)) %>%
    arrange(desc(z_score))
  • Related