I have multiple same format csv files that I need to combine but before that
- Header is not the first row but 4th row. Should I remove first 3 row by skip? Or should I reassign the header?
- I need to add in a column which is the ID of the file (same as file name) before I combine.
- Then I need to extract only 4 columns from a total of 7.
- Sum up numbers under a category.
- Combine all csv files into one.
This is what I have so far where I do Step 1, 3, 4 then only 2 to add in a column then 5, not sure if I should add in the ID column first or not?
files = list.files(pattern = "*.csv", full.names = TRUE)
library("tidyverse")
library("dplyr")
data = data.frame()
for (file in files){
temp <- read.csv(file, skip=3, header = TRUE)
colnames(temp) <- c("Volume", "Unit", "Category", "Surpass Object", "Time", "ID")
temp <- temp [, c("Volume", "Category", "Surpass Object")]
temp <- subset(temp, Category =="Surface")
mutate(id = file)
aggregate(temp$Volume, by=list(Category=temp$Category), FUN=sum)
}
And I got an error:
Error in is.data.frame(.data) :
argument ".data" is missing, with no default
The code is fine if I didn't put in the mutate line so I think the main problem comes from there but any advice will be appreciated.
I am quite new to R and really appreciate all the comments that I can get here.
Thanks in advance!
CodePudding user response:
Since you appear to be trying to use dplyr
, I'll stick with that theme.
library(dplyr)
library(purrr)
files = list.files(pattern = "*.csv", full.names = TRUE)
results <- map_dfr(setNames(nm = files), ~ read.csv(.x, skip=3, header=TRUE), .id = "filename") %>%
select(whatever, your, four_columns, are) %>%
group_by(filename, Category) %>%
summarize(Volume = sum(Volume))
Walk-through:
purrr::map_dfr
iterates our function (read.csv(...)
) over each of the inputs (each file infiles
) and row-concatenates it. Since we named the files with themselves (setNames(nm=files)
is akin tonames(files) <- files
), we can useid="filename"
which adds a "filename" column that reflects from which file each row was taken.select(...)
whatever four columns you said you needed. Frankly, since you're aggregating, we really only needc("filename", "Category", "Volume")
, anything else and you likely have missed something in your explanation.group_by(..)
will allow us to get one row for each filename, eachCategory
, whereVolume
is a sum (calculated in the next step,summarize
).
CodePudding user response:
You can use read.csv()
, but if there are many files, I suggest using the fread()
from the data.table
package. It is significantly faster. I used fread()
here, but it will still work if you switch it out for read.csv()
. fread()
is more advanced, as well. You will find that even things like skip
can sometimes be left out, and it will still be read correctly.
library(tidyverse)
library(data.table)
add_filename <- function(flnm){
fread(flnm, skip = 3) %>% # read file
mutate(id = basename(flnm)) # creates new col id w/ basename of the file
}
# single data frame all CSVs; id in first col
df <- list.files(pattern = "*.csv", full.names = TRUE) %>%
map_df(~add_filename) %>%
select(id, Volume, Category, `Surpass Object`)
I get the impression that you wanted to aggregate but keep the consolidated data frame, as well. If that's the case, you'll keep the aggregation separate from building the data frame.
df %>% # not assigned to a new object, so only shown in console
filter(Category == "Surface") %>% # filter for the category desired
{sum(.$Volume)} # sum the remaining values for volume
If you are not aware, the period in that last call is the data carried forward, so in this case, the filtered data. The simplest way (perhaps not the best way) to explain the {} is that sum()
is not designed to handle data frames - therefore isn't inherently friendly with dyplr
piping.
If you wanted the sum of volume for every category instead of only "Surface"
that you had coded in your question, then you would use this instead:
df %>%
group_by(Category) %>%
summarise(sum(Volume))
Notice I used the British spelling of summarize here. The function summarize()
is in a lot of packages. I have just found it easier to use the British spelling for this function whenever I want to make sure it's the dplyr
function that I've called. (tidyverse
accepts the American and British spelling for nearly all functions, I think.)