Home > Back-end >  Split a dataframe into smaller ones based on their value types
Split a dataframe into smaller ones based on their value types

Time:10-18

I am writing a script to clean a very long data frame where I have a key-value pair in two columns named parameter and value, where parameter is a factor with multiple levels, and value can be either a date, a number, or a string.

In the example below I use as.numeric(), lubridate::dmy_hms() and use the NAs they generate to understand what is not numeric and not a date. This somehow allows me to generate different lists based on the value type and then split them based on their parameter type. I guess the rexprex is more explanatory than my description.

The issue is this approach seems slow, convoluted, and somehow patchy. Is there a better way to do this?

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

df <- data.table::data.table(
  value = c(
    "13.249",
    "NN",
    "13.763",
    "13.665",
    "09/Mar/2022 11:05:12",
    "12.276",
    "13.405",
    "13.684",
    "13.744",
    "14.201"
  ),
  parameter = as.factor(
    c(
      "volume",
      "signature",
      "volume",
      "volume",
      "inoculated",
      "volume",
      "volume",
      "volume",
      "volume",
      "volume"
    )
  )
)

# Finds the numeric part of the data frame and separates it from the rest
vector_numeric_df <- df %>% 
  mutate(value = as.numeric(value)) %>% 
  filter(!is.na(value)) %>%
  group_by(parameter) %>% 
  summarise(n = n()) %>%
  select(parameter) %>% 
  pull() %>% 
  as.vector()
#> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

vector_non_numeric_df <- df %>% 
  mutate(value = as.numeric(value)) %>% 
  filter(is.na(value)) %>%
  group_by(parameter) %>% 
  summarise(n = n()) %>%
  select(parameter) %>% 
  pull() %>% 
  as.vector()
#> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

numeric_df <- df %>%
  filter(parameter %in% vector_numeric_df) %>% 
  mutate(value = as.numeric(value))

numeric_df_list <- numeric_df %>%
  group_split(parameter) %>%
  setNames(sort(unique(numeric_df$parameter)))

# Finds the date part of the data frame and separates it from the rest
vector_date_df <- df %>%
  filter(parameter %in% vector_non_numeric_df) %>% 
  mutate(value = dmy_hms(value)) %>%
  filter(!is.na(value)) %>%
  group_by(parameter) %>% 
  summarise(n = n()) %>%
  select(parameter) %>% 
  pull() %>% 
  as.vector()
#> Warning: 1 failed to parse.

date_df <- df %>%
  filter(parameter %in% vector_date_df) %>% 
  mutate(value = dmy_hms(value))

date_df_list <- date_df %>%
  group_split(parameter) %>%
  setNames(sort(unique(date_df$parameter)))

# Makes everything else a character
vector_char_df <- df %>%
  filter(!parameter %in% vector_numeric_df,
         !parameter %in% vector_date_df) %>% 
  group_by(parameter) %>% 
  summarise(n = n()) %>%
  select(parameter) %>% 
  pull() %>% 
  as.vector()

char_df <- df %>%
  filter(parameter %in% vector_char_df) %>% 
  mutate(value = as.character(value))

char_df_list <- char_df %>%
  group_split(parameter) %>%
  setNames(sort(unique(char_df$parameter)))

# Generates all the objects

list2env(numeric_df_list, envir = .GlobalEnv)
#> <environment: R_GlobalEnv>
list2env(date_df_list, envir = .GlobalEnv)
#> <environment: R_GlobalEnv>
list2env(char_df_list, envir = .GlobalEnv)
#> <environment: R_GlobalEnv>

str(inoculated)
#> tibble [1 x 2] (S3: tbl_df/tbl/data.frame)
#>  $ value    : POSIXct[1:1], format: "2022-03-09 11:05:12"
#>  $ parameter: Factor w/ 3 levels "inoculated","signature",..: 1
str(signature)
#> tibble [1 x 2] (S3: tbl_df/tbl/data.frame)
#>  $ value    : chr "NN"
#>  $ parameter: Factor w/ 3 levels "inoculated","signature",..: 2
str(volume)
#> tibble [8 x 2] (S3: tbl_df/tbl/data.frame)
#>  $ value    : num [1:8] 13.2 13.8 13.7 12.3 13.4 ...
#>  $ parameter: Factor w/ 3 levels "inoculated","signature",..: 3 3 3 3 3 3 3 3

Created on 2022-10-17 with reprex v2.0.2

CodePudding user response:

Since each parameter has a different data type, we can split() by parameter and then use parse_guess() to do most of the work. Your date data are non-standard and need a little special handling. Lastly, I usually try to avoid dumping related data into separate environment objects, but since it's apparently what you want, here's how to do it:

library(tidyverse)
library(lubridate)

# split the data frame by 'parameter' and attempt to guess each value format
df_split <- split(df, df$parameter) %>% 
  lapply(\(x) x %>% mutate(value = parse_guess(value)))

# the date data have a non-standard format and need special handling
df_split$inoculated <- df_split$inoculated %>% 
  mutate(value = dmy_hms(value))

# place list objects in global environment: not good practice but it seems to be what you want
list2env(df_split, envir = .GlobalEnv)

CodePudding user response:

As jdobres points out, it seems you can simply do split by parameter. If you still want to split by data type try this:

df_type <- suppressWarnings(ifelse(
    !is.na(as.numeric(df$value)),"volume",ifelse(
        !is.na(dmy_hms(df$value)),"inoculated","signature")))

l <- split(df,df_type)

l$inoculated
l$signature
l$volume

CodePudding user response:

Others have pointed out that split does what you want in terms of creating the separate data frames.

Assuming you don't want to use df$parameter, or you can't trust it, you can just parse df$value with your as. functions to create the split factor:

split(df, 
  is.na(suppressWarnings(as.numeric(df$value)))        #  1 if not numeric
  2 * is.na(as.Date(df$value, "%d/%b/%Y %H:%m:%s"))    #  2 if not date
)
$`1`
                 value  parameter
5 09/Mar/2022 11:05:12 inoculated

$`2`
    value parameter
1  13.249    volume
3  13.763    volume
4  13.665    volume
6  12.276    volume
7  13.405    volume
8  13.684    volume
9  13.744    volume
10 14.201    volume

$`3`
  value parameter
2    NN signature

A small technical note, if there are any NAs in the df$value column they will be dropped (because split doesn't like them, and your parsing will return NAs)

  • Related