Use part of row data for new columns in R-CodePudding

I have a very large df with a column that contains the file directory for each row's data.

Example: D:Mouse_2174/experiment/13/trialsummary.txt.1

I would like to create 2 new columns, one with only the mouse ID (2174) and one with the session number (13). There will be different IDs and session numbers based on the row.

I've used sub as recommended here (match part of names in data.frame to new column), but only can get the subject column to say "D:Mouse_2174" I've added an additional line and can get it down to "D:Mous2174"

Is there a way to eliminate all chars before _ and after / to obtain mouse ID? For session number, I'm not quite as sure what to do with multiple / in the directory name.

percent_correct_list$mouse_id <- sub("/. ", "", percent_correct_list$rn)
#gives me D:Mouse_2174
percent_correct_list$mouse_id <- sub(" ._", "", percent_correct_list$mouse_id)
#gives me D:Mous2174

Here is sample code for the directories:

df <- data.frame(
             rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
                    "D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
                    "D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
                    "D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")
)

What I want:

rn	id	session
D:..	2174	9
D:..	2181	33
D:..	2183	107
D:..	2185	87

Maybe there's some way to do this earlier along in the process too (like when I import all the data into a df using lapply - but this is good as well)

CodePudding user response：

Here's a somewhat long-winded solution, using tidyr::separate. Perhaps there is something more concise/elegant.

It does assume that all values of rn take the same format.

library(dplyr)
library(tidyr)

new_df <- df %>% 
  # separate on / into 4 new columns
  separate(rn, into = c(paste0("item", 1:4)), sep = "/", remove = FALSE) %>%
  # remove unwanted columns
  select(-item2, -item4) %>% 
  # separate again on _ into 2 new columns
  separate(item1, sep = "_", into = c("prefix", "id")) %>%
  # retain and rename desired columns 
  select(rn, id, session = item3)

Result:

                                                 rn   id session
1   D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174       9
2  D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181      33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183     107
4  D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185      87

CodePudding user response：

For sure isnt an elegant solution. Only works if your ID and Session are always numbers...

df <- data.frame(
            rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
                   "D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
                   "D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
                   "D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")) %>%
            # Extract all numeric values from the string
            mutate(allnums = regmatches(rn, gregexpr(" [[:digit:]] ", rn)))%>%
            # Separate them
            separate(allnums, into = c("id", "session", "idk"), sep = "\\,") %>%
            # Extract them individually
            mutate(id = as.numeric(regmatches(id, gregexpr(" [[:digit:]] ", id,))),
                   session = as.numeric(regmatches(session, gregexpr(" [[:digit:]] ", session)))) %>%
            select(-idk)

Output:

1  D:Mouse_2174/iti_intervals/9/trialsummary.txt.1   2174       9
2  D:Mouse_2181/iti_intervals/33/trialsummary.txt.1  2181      33
3  D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183     107
4  D:Mouse_2185/iti_intervals/87/trialsummary.txt.1  2185      87