Home > Enterprise >  R: Extract first element of a list within a df into new reduced df
R: Extract first element of a list within a df into new reduced df

Time:10-13

I have some raw data from an instrument that I need to get into shape, so I can work with it. I guess it's an easy fix, but I need some help.

The df looks like this:

date <-c("Date1","Date2","Date3")

data <-c("1,234,567,2345;2,345,5677,256;3,576,345,3456", "1,564,567,2345;2,745,5677,256;3,577,345,8456", "1,234,567,2345;2,345,5677,256,;3,555,345,3456;....")

df<-data.frame(date, data)

So for every date there are size classes (1,2 or 3..) with three corresponding measurements/values, separated by a ",". The different size classes are separated by a ";". Now i would like to access the first value per size class and transfer it into a new df. So that should look like:

date<-c("Date1","Date2","Date3")
data_sizeclass_1<-c(234,564,234) 
data_sizeclass_2<-c(345,745,345) 
data_sizeclass_3<-c(576,577,555) 

df<-data.frame(date,data_sizeclass_1,data_sizeclass_2,data_sizeclass_3)

I hope that it makes sense.

So far I managed to separate the data column into separate columns (with cSplit). I would be able to assemble the columns into a new df, but then I would have to choose each column manually, and since I have more than 200 size classes that would be a lot of work. I would like to find a solution to directly extract only one of the measurements into a new df. Thanks for your help.

CodePudding user response:

library(data.table)

#set df to data.table
dt <- as.data.table(df)

#find the amount of columns which need to be created (max nr of different sizeclasses)
colnr <- max(lengths(strsplit(df$data, ";")))

#create columns for each sizeclass
for(i in 1:colnr ){
  dt[, paste0("data_sizeclass_", i) := sapply(strsplit(data, ";"), `[`, i)]
}

dt

    date                                           data data_sizeclass_1 data_sizeclass_2 data_sizeclass_3
1: Date1  1,234,567,2345;2,345,5677,256;3,576,345,3456;   1,234,567,2345   2,345,5677,256   3,576,345,3456
2: Date2  1,564,567,2345;2,745,5677,256;3,577,345,8456;   1,564,567,2345   2,745,5677,256   3,577,345,8456
3: Date3 1,234,567,2345;2,345,5677,256,;3,555,345,3456;   1,234,567,2345  2,345,5677,256,   3,555,345,3456

In the second part we use strsplit to extract the second element of data

#pivot longer
dt <- melt(dt, measure.vars = paste0("data_sizeclass_", 1:colnr))

#extract second element of strsplit, using , as separator
dt[, value := sapply(strsplit(value, ","), `[`, 2)]

#pivot wider (to original shape)
dt <- dcast(dt, date data ~ variable, value.var = "value")
dt

    date                                           data data_sizeclass_1 data_sizeclass_2 data_sizeclass_3
1: Date1  1,234,567,2345;2,345,5677,256;3,576,345,3456;              234              345              576
2: Date2  1,564,567,2345;2,745,5677,256;3,577,345,8456;              564              745              577
3: Date3 1,234,567,2345;2,345,5677,256,;3,555,345,3456;              234              345              555

CodePudding user response:

Maybe you can try the base R code below

with(
  df,
  cbind(
    df,
    `colnames<-`(
      t(sapply(
        strsplit(data, ";"),
        function(x) as.numeric(regmatches(x, regexpr("(?<=,)\\d ", x, perl = TRUE)))
      )),
      paste0("data_sizeclass_", seq(nrow(df)))
    )
  )
)

which gives

   date                                               data data_sizeclass_1
1 Date1       1,234,567,2345;2,345,5677,256;3,576,345,3456              234
2 Date2       1,564,567,2345;2,745,5677,256;3,577,345,8456              564
3 Date3 1,234,567,2345;2,345,5677,256,;3,555,345,3456;....              234
  data_sizeclass_2 data_sizeclass_3
1              345              576
2              745              577
3              345              555

CodePudding user response:

Does this work?

library(tidyr)
library(dplyr)
library(stringr)
df %>%
  # first get rid of digit at string beginning and, respectively, after semi-colon:
  mutate(data = gsub("^\\d,|(?<=;)\\d,", "", data, perl = TRUE)) %>%
  # separate into columns using semi-colon as separator:
  separate(data, 
           into = c("data_sizeclass_1", "data_sizeclass_2", "data_sizeclass_3"), 
           sep = ";",
           convert = TRUE) %>%
  # finally extract the initial numbers in the new columns:
  mutate(across(c(-date), ~ str_extract(., "\\d ")))
   date data_sizeclass_1 data_sizeclass_2 data_sizeclass_3
1 Date1              234              345              576
2 Date2              564              745              577
3 Date3              234              345              555
  •  Tags:  
  • r
  • Related