I'm having problems in generalising a method of dealing with some 'problem data' - vectorised elements of a list. (I'm not entirely sure if I'm describing this accurately so please forgive my ignorance).
Objective:
I'm trying to compile a function that deals with the following problem automatically as the intention is to run / utilise this methodology repeatedly.
Problem:
I've pulled some sensor data from a db by a unit name (equipment the sensor is from), usually the sensor data comes as a vector (as per the desired outcome - below) but for certain units (where the data collectors are configured differently) it returns a list of vectors.
The number of elements within the vector relate to the number of units on site (i.e. if there are two units on site, /01 and /02, then the vector will contain 2 elements....and so on)
I want only the sensor data that pertains to that unit to be kept in the relevant column.
Not all units are configured in the same fashion, so I was only intending on running a a function when the class of the column has been confirmed that it is a list.
The number of sensors and their tags will be different each time (input into the main function).
Intended methodology:
- Perform following checks and corrective measures over each column
- Check the column class - check to see if class is a 'list'
- If FALSE, do nothing.
- If TRUE, perform following over each row with 'unlist_func'(sub function):
- Check the end of the text string on each row of 'unit' column and extract unit_no
- Unlist the current element and select the number in the vector that corresponds with the unit_no in the unit column
- Ammend the column class to
Reproducible problem:
The following tibble is an example of what I'm working with:
unit <- c('a2b7/01', 'a2b7/02', 'a2b7/03', 'a2b7/01', 'a2b7/02', 'a2b7/03', 'a2b7/01', 'a2b7/02', 'a2b7/03')
sen1 <- list(c(1,2,3), c(4,5,6), c(7,8,9), c(10,11,12), c(13,14,15), c(16,17,18), c(19,20,21), c(22,23,24), c(25,26,27))
sen2 <- list(1, 2, 3, 4, 5, 6, 7, 8, 9)
sensor_data <- tibble(unit, sen1, sen2)
Desired Outcome:
unit <- c('a2b7/01', 'a2b7/02', 'a2b7/03', 'a2b7/01', 'a2b7/02', 'a2b7/03', 'a2b7/01', 'a2b7/02', 'a2b7/03')
sen1 <- c(1,5,9,10,14,19,23,27)
sen2 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
sensor_data_new <- tibble(unit, sen1, sen2)
Attempt Thus far and Request:
So as not to turn up empty handed I've posted up a hasty version of my intended methodology but my apologies, it's a work in progress and I appreciate there are errors within, but I was hoping to appeal to those out there to help if possible or suggest a different approach.
All advice would be greatly appreciated.
function(unit, sensors, date_from, date_to){
# ------ QUERYING TASK ------------------------
sensor_data <- 'pulls sensor info from unit names'
# ------ SUB FUNCTION - Unlist function -------
unlist_func <- function(df = sensor_data, j){
sensor_data_distinct <- df %>%
mutate(unit_no = str_extract(unit, '\\d$')) %>%
select(unit_no) %>%
distinct()
for (i in 1:nrow(df))
{
if(length(df[, j[[i]] ]) => 1 & str_ends(df$unit, sensor_data_distinct$unit_no)){
unlist(df[i, j])
df[sensor_data_distinct$unit_no]
} else(sensor_data[, j])
}
# ------- STAGE 1 CLEANSING FUNCTION -------------
# unit will always be in column 1 and datetime always in column 2
stg1_cleanse <- for(j in 3:ncol(sensor_data)) { # for-loop over columns
if (is.list(sensor_data[, j]) == TRUE){
lapply(unlist_func.....)
}
}
}
# ------- STAGE 1 CLEANSING FUNCTION -------------
stg2_cleanse <- 'further cleansing routine'
}
CodePudding user response:
I think you can greatly simplify your function. This appears to produce the result you want. The custom function extract_func
implements the rule in which each row is checked for length. If it is a single-element value, the single element is returned; if it is a vector, the element at the requested index is returned instead. We can then use dplyr's rowwise
and across
functions to apply this function to columns "sen1" and "sen2", passing "unit_num" as the index argument (when needed).
library(tidyverse)
unit <- c('a2b7/01', 'a2b7/02', 'a2b7/03', 'a2b7/01', 'a2b7/02', 'a2b7/03', 'a2b7/01', 'a2b7/02', 'a2b7/03')
sen1 <- list(c(1,2,3), c(4,5,6), c(7,8,9), c(10,11,12), c(13,14,15), c(16,17,18), c(19,20,21), c(22,23,24), c(25,26,27))
sen2 <- list(1, 2, 3, 4, 5, 6, 7, 8, 9)
sensor_data <- tibble(unit, sen1, sen2)
extract_func <- function(data, idx) {
# check for NULL data and convert to NA if it is present
if (is.null(data)) {
data <- NA
}
if (length(data) == 1) {
return(data[1])
} else {
return(data[idx])
}
}
sensor_data_clean <- sensor_data %>%
rowwise() %>%
mutate(
unit_num = as.numeric(str_extract(unit, '\\d$')),
across(c(sen1, sen2), ~extract_func(., unit_num), .names = 'extract_{.col}')
)
# A tibble: 9 × 6
# Rowwise:
unit sen1 sen2 unit_num extract_sen1 extract_sen2
<chr> <list> <list> <dbl> <dbl> <dbl>
1 a2b7/01 <dbl [3]> <dbl [1]> 1 1 1
2 a2b7/02 <dbl [3]> <dbl [1]> 2 5 2
3 a2b7/03 <dbl [3]> <dbl [1]> 3 9 3
4 a2b7/01 <dbl [3]> <dbl [1]> 1 10 4
5 a2b7/02 <dbl [3]> <dbl [1]> 2 14 5
6 a2b7/03 <dbl [3]> <dbl [1]> 3 18 6
7 a2b7/01 <dbl [3]> <dbl [1]> 1 19 7
8 a2b7/02 <dbl [3]> <dbl [1]> 2 23 8
9 a2b7/03 <dbl [3]> <dbl [1]> 3 27 9
Since you're just checking the length of the vector and then extracting a single element, you could do the following inline within mutate
(though a predefined custom function like the above gives you a little more flexibility for the future).
sensor_data_clean <- sensor_data %>%
rowwise() %>%
mutate(
unit_num = as.numeric(str_extract(unit, '\\d$')),
across(c(sen1, sen2), ~(.[min(length(.), unit_num)]), .names = 'extract_{.col}')
)