Need advice on using R to clean up data-CodePudding

I have multiple same format csv files that I need to combine but before that

Header is not the first row but 4th row. Should I remove first 3 row by skip? Or should I reassign the header?
I need to add in a column which is the ID of the file (same as file name) before I combine.
Then I need to extract only 4 columns from a total of 7.
Sum up numbers under a category.
Combine all csv files into one.

This is what I have so far where I do Step 1, 3, 4 then only 2 to add in a column then 5, not sure if I should add in the ID column first or not?

files = list.files(pattern = "*.csv", full.names = TRUE)

library("tidyverse")
library("dplyr")

data = data.frame()

for (file in files){
    temp <- read.csv(file, skip=3, header = TRUE)
    colnames(temp) <- c("Volume", "Unit", "Category", "Surpass Object", "Time", "ID")
    temp <- temp [, c("Volume", "Category", "Surpass Object")]
    temp <- subset(temp, Category =="Surface")
    mutate(id = file)
    aggregate(temp$Volume, by=list(Category=temp$Category), FUN=sum)
    
}

And I got an error:

Error in is.data.frame(.data) : 
  argument ".data" is missing, with no default

The code is fine if I didn't put in the mutate line so I think the main problem comes from there but any advice will be appreciated.

I am quite new to R and really appreciate all the comments that I can get here.

Thanks in advance!

CodePudding user response：

Since you appear to be trying to use dplyr, I'll stick with that theme.

library(dplyr)
library(purrr)
files = list.files(pattern = "*.csv", full.names = TRUE)
results <- map_dfr(setNames(nm = files), ~ read.csv(.x, skip=3, header=TRUE), .id = "filename") %>%
  select(whatever, your, four_columns, are) %>%
  group_by(filename, Category) %>%
  summarize(Volume = sum(Volume))

Walk-through:

purrr::map_dfr iterates our function (read.csv(...)) over each of the inputs (each file in files) and row-concatenates it. Since we named the files with themselves (setNames(nm=files) is akin to names(files) <- files), we can use id="filename" which adds a "filename" column that reflects from which file each row was taken.
select(...) whatever four columns you said you needed. Frankly, since you're aggregating, we really only need c("filename", "Category", "Volume"), anything else and you likely have missed something in your explanation.
group_by(..) will allow us to get one row for each filename, each Category, where Volume is a sum (calculated in the next step, summarize).

CodePudding user response：

You can use read.csv(), but if there are many files, I suggest using the fread() from the data.table package. It is significantly faster. I used fread() here, but it will still work if you switch it out for read.csv(). fread() is more advanced, as well. You will find that even things like skip can sometimes be left out, and it will still be read correctly.

library(tidyverse)
library(data.table)

add_filename <- function(flnm){
    fread(flnm, skip = 3) %>%   # read file
    mutate(id = basename(flnm)) # creates new col id w/ basename of the file 
}

# single data frame all CSVs; id in first col
df <- list.files(pattern = "*.csv", full.names = TRUE) %>%
    map_df(~add_filename) %>%
    select(id, Volume, Category, `Surpass Object`)

I get the impression that you wanted to aggregate but keep the consolidated data frame, as well. If that's the case, you'll keep the aggregation separate from building the data frame.

df %>%       # not assigned to a new object, so only shown in console
    filter(Category == "Surface") %>%  # filter for the category desired
    {sum(.$Volume)}                    # sum the remaining values for volume

If you are not aware, the period in that last call is the data carried forward, so in this case, the filtered data. The simplest way (perhaps not the best way) to explain the {} is that sum() is not designed to handle data frames - therefore isn't inherently friendly with dyplr piping.

If you wanted the sum of volume for every category instead of only "Surface" that you had coded in your question, then you would use this instead:

df %>% 
    group_by(Category) %>%
    summarise(sum(Volume))

Notice I used the British spelling of summarize here. The function summarize() is in a lot of packages. I have just found it easier to use the British spelling for this function whenever I want to make sure it's the dplyr function that I've called. (tidyverse accepts the American and British spelling for nearly all functions, I think.)