Home > OS >  extracting part of filename and add them as value in column
extracting part of filename and add them as value in column

Time:04-26

I have imported multiple excel files at once (more then 20) in R and I want to give values to columns in the dataframe based on the file name.

Let us say that the filename are the following: "A1_412", "A1_413" , "A1_415" , "A2_412", "A2_413", "A2_415" , "A23_413" "A4_5", "A6_413_TS"... etc

I want to create two column where the value in column_1 will show A1, A2, A23 if these row are from the file A1_412, A1_413 or A1_415

the value in column_2 will show 412 or 413 or 415 based on the name of the file

I have already imported all the files

#import all the excel files
file.list <- list.files(path = "xxxxxxxxxx",    
                       pattern = "*.xlsx",
                       full.names = TRUE) 

alldata<-file.list %>%
  map_dfr(~read_excel(.x)%>%
  mutate_all(as.character))

this is an example of the end result that im searching for:

G     B column_1 column_2
a  Lion       A1      415
b   Cat       A1      412
c Horse      A23      413
d   dog       A2      415
e Mouse      A23      413
f Snake       A2      412
g zebra       A6      413
h bird         

bird is coming from the file A4_5, I want the column_1 to be empty instead of showing A4 and column_2 to be empty instead of showing 5 Zebra is coming from file A6_413_TS so here I want column_1 to show A6 and column_2 to only show 413 zebra is coming

I need a solution that can be generalized to a large number of variables.

Any suggestions on how to do this?

CodePudding user response:

file.list %>%
  set_names(.) %>%
  map_df(~mutate_all(read_excel(.x), as.character), .id = 'grp') %>%
  mutate(grp = str_remove(basename(grp), ".xlsx")) %>%
  separate(grp, c('col1', 'col2'), sep = '_', extra = 'merge')

CodePudding user response:

You could try something like this. Basically, I use lapply() to loop over the filenames, each time using basename() to get the base file name, and then use str_remove() to remove the extension, and strsplit to split on the _ between the prefix and the suffix

bind_rows(
  lapply(file.list, function(f) {
    bn = basename(f)
    colvals = strsplit(str_remove(bn, ".xlsx"), "_")[[1]]
    read_excel(f) %>% mutate_all(as.character()) %>% mutate(col1 = colvals[1], col2=colvals[2])
  })
)

  •  Tags:  
  • r
  • Related