Home > front end >  Cut a long column of characters between two commas in an R dataframe
Cut a long column of characters between two commas in an R dataframe

Time:06-29

I'm working on R, I would like to cut my column to have only the text between the 3rd and 4th comma.

Col1<- c("Sample1")
Col2 <- c("1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X")

df <- data.frame(Col1, Col2)
Col1 Col2
Sample1 1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,f:Raphid-pennate,g:Raphid-pennate_X

With this table, I would like to have:

Col1 Col2
Sample1 Bacillariophyta

My dataset is really big, does anyone know how I can do this?

CodePudding user response:

You can use sapply to extract the 4th element with strsplit command.

df$Col3 <- sapply(df$Col2, function(x)unlist(strsplit(x, ","))[4])

df

#     Col1
#1 Sample1
                                                                                                                                                                             #Col2
#1 #1A00318:268:H27G3DSX3:4:1101:20989:1047KJ758397.1.1794_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X
#               Col3
#1 c:Bacillariophyta

CodePudding user response:

An alternative would be to use sub:

sub("^(?:[^,] ,){3}([^,] ).*", "\\1", df$Col2) -> df$Col2

# Col1              Col2
# 1 Sample1 c:Bacillariophyta
  • Related