Home > Mobile >  How to infer missing values in a R data frame from other rows that have the data?
How to infer missing values in a R data frame from other rows that have the data?

Time:05-05

In my example I have a table with products that have a unique ID in Store 1 and a different (but also unique) ID in Store 2.

As a rule, the same product has always the same ID in store 1 and always the same ID in store 2. i.e. if I know only one of these variables, I should be able to fill in the remaining ones.

However, I don't have a neat reference table or list that shows which Product and IDs belong together. All I have to start with is a table with lots of gaps like this:

Before

I would like to fill the gaps using the information that is already in the table like so:

After

Is there a R function that can do this or how would you go about this?

#Example data
df <- data.frame(
  c(NA, "Shovel", NA, NA, "Gloves", NA),
  c("W06", NA, "W06", "W11", "W11", NA),
  c("EF001", "EF001", NA, NA, "EF004", "EF004")
)
colnames(df) <- c("Product", "Store1_ID", "Store2_ID")


df2 <- data.frame(
  c("Shovel",NA, NA, NA, "Gloves", NA),
  c(NA, "W06", "W06", "W11", "W11", NA),
  c("EF001", "EF001", NA, NA, "EF004", "EF004")
)
colnames(df2) <- c("Product", "Store1_ID", "Store2_ID")

CodePudding user response:

Create a grouping index column, after looping across the 'Store' columns, apply the na.locf0 to replace the NA elements with previous non-NA values, if there are NAs as beginning element, then apply the na.locf0 with fromLast = TRUE on top of the first na.locf0, convert this to logical column with duplicated, negate (!) so that it returns TRUE for first non-duplicated element and FALSE for others, get the cumulative sum (cumsum), use pmax to find the max index elementwise to create the 'grp', then we use fill on the other columns

library(tidyr)
library(dplyr)
library(purrr)
df %>% 
  group_by(grp = invoke(pmax, across(starts_with('Store'), 
    ~ cumsum(!duplicated(zoo::na.locf0(zoo::na.locf0(.x), 
           fromLast = TRUE))), .names = "{.col}_new")))  %>% 
 fill(everything(), .direction = "downup") %>%
 ungroup %>% 
 select(-grp)
# A tibble: 6 × 3
  Product Store1_ID Store2_ID
  <chr>   <chr>     <chr>    
1 Shovel  W06       EF001    
2 Shovel  W06       EF001    
3 Shovel  W06       EF001    
4 Gloves  W11       EF004    
5 Gloves  W11       EF004    
6 Gloves  W11       EF004   
  • The .names is used to create a new column, though in this case, it is not needed

Same output with df2

df2 %>% 
  group_by(grp = invoke(pmax, across(starts_with('Store'), 
    ~ cumsum(!duplicated(zoo::na.locf0(zoo::na.locf0(.x), 
     fromLast = TRUE))), .names = "{.col}_new"))) %>% 
  fill(everything(), .direction = "downup") %>%
  ungroup %>% 
  select(-grp)

-output

# A tibble: 6 × 3
  Product Store1_ID Store2_ID
  <chr>   <chr>     <chr>    
1 Shovel  W06       EF001    
2 Shovel  W06       EF001    
3 Shovel  W06       EF001    
4 Gloves  W11       EF004    
5 Gloves  W11       EF004    
6 Gloves  W11       EF004    

CodePudding user response:

Update II: For the new dataset df2:

add the .direction argument to fill in both direction the grouping variable and then apply the fill_run funciton:

library(dplyr)
library(runner)

df2 %>% 
  fill(Store1_ID, .direction = "updown") %>% 
  group_by(Store1_ID) %>% 
  mutate(across(everything(), ~fill_run(., run_for_first = TRUE)))
  Product Store1_ID Store2_ID
  <chr>   <chr>     <chr>    
1 Shovel  W06       EF001    
2 Shovel  W06       EF001    
3 Shovel  W06       EF001    
4 Gloves  W11       EF004    
5 Gloves  W11       EF004    
6 Gloves  W11       EF004 

Update to consider groupwise:

library(dplyr)
library(runner)

df %>% 
  fill(Store1_ID) %>% 
  group_by(Store1_ID) %>% 
  mutate(across(everything(), ~fill_run(., run_for_first = TRUE)))
   Product Store1_ID Store2_ID
  <chr>   <chr>     <chr>    
1 Shovel  W06       EF001    
2 Shovel  W06       EF001    
3 Shovel  W06       EF001    
4 Gloves  W11       EF004    
5 Gloves  W11       EF004    
6 Gloves  W11       EF004    
  • Related