Home > database >  How can I get the average for a column from multiple files in R?
How can I get the average for a column from multiple files in R?

Time:07-02

I am very new to R and this is probably not a difficult problem to solve, but I have been going around and around and can't get what I need, so I would be very grateful if someone could give me some advice. I've also never asked a question on one of these forums before, so I apologize if I am not following all of the normal conventions for posting.

I have multiple output files from another program that I am trying to do some analysis with using R. The number of output files will not be known in advance. I read them into my R code and store them in the variable listFinal.data.

I am trying to loop through the output files and group by the different values in the column Entity.Type, count the number of occurrences for each of the different entity types and then I need to get the average number of occurrences for each entity type across all of the output files.

Here is a snippet of the column I need to work with in the output files:

ID Entity.Type
1 Ground
2 Ground
3 Air
4 Air
5 Sea
6 Ground
7 Sea
8 Ground
9 Air
10 Ground

Results I am looking for for this single file would be:

Ground Air Sea
5 3 2

I can do this successfully for just one file, but when I use the code that I have written and I have multiple files, I get a result like above for each file when what I really want is a single result like above that is the average across all files.

Here is the code that I am using:

for (h in 1:length(listFinal.data)) #listFinal.data is all the output files from another program
  listVeh.data[[h]] <- listFinal.data[[h]] %>%
  filter(Entity.Type != "Lifeform") %>%  #remove people, just count vehicles
  group_by(Entity.Type) %>%
  summarize(n = n()) 

CodePudding user response:

Here's a toy example, where you have written the output data as a list:

set.seed(4)
d1 <- data.frame(ID = 1:30,
                 Entity.Type = sample(c("Ground", "Air", "Sea"), 30, replace=TRUE))
d2 <- data.frame(ID = 1:30,
                 Entity.Type = sample(c("Ground", "Air", "Sea"), 30, replace=TRUE))

datlist <- list(d1, d2)
names(datlist) <- c("d1", "d2")

I prefer ldply over do.call(rbind, lapply(...)) as it adds the id of the data directly for named list.

output <- plyr::ldply(datlist, function(x) x %>% group_by(Entity.Type) %>% summarise(n=n()))

  .id Entity.Type  n
1  d1         Air  9
2  d1      Ground  9
3  d1         Sea 12
4  d2         Air 14
5  d2      Ground  9
6  d2         Sea  7

Calculating the mean value in the whole list will be straightforward.

output %>% group_by(Entity.Type) %>% summarise(mean(n))

# A tibble: 3 x 2
  Entity.Type `mean(n)`
  <chr>           <dbl>
1 Air              11.5
2 Ground            9  
3 Sea               9.5
  •  Tags:  
  • r
  • Related