Home > OS >  R: if column that starts with a value is all na, then print na
R: if column that starts with a value is all na, then print na

Time:12-29

I have a column primary in a dataframe with already with set values. I'm trying to write a code where if all columns that start with "dx" are NA, then the NA, otherwise, print the original value.

To note, this is only a segment of the dataframe, there are many other columns

My current dataframe

#    dx1  dx2 dx3 dx4 dx5     primary
# 1 I629 <NA>  NA  NA  NA Unspecified
# 2 S065 <NA>  NA  NA  NA        S065
# 3 I629 S066  NA  NA  NA        I629
# 4 I629 I629  NA  NA  NA Unspecified
# 5 NA   NA    NA  NA  NA Unspecified

Desired output:

#    dx1  dx2 dx3 dx4 dx5     primary
# 1 I629 <NA>  NA  NA  NA Unspecified
# 2 S065 <NA>  NA  NA  NA        S065
# 3 I629 S066  NA  NA  NA        I629
# 4 I629 I629  NA  NA  NA Unspecified
# 5 NA   NA    NA  NA  NA NA

CodePudding user response:

With dplyr

library(tidyverse) 

df %>%
  mutate(primary = case_when(
    if_all(starts_with("dx"), is.na) ~ NA_character_,
                                   T ~ primary
    ))

# A tibble: 5 × 6
  dx1   dx2   dx3   dx4   dx5   primary    
  <chr> <chr> <lgl> <lgl> <lgl> <chr>      
1 I629  <NA>  NA    NA    NA    Unspecified
2 S065  <NA>  NA    NA    NA    S065       
3 I629  S066  NA    NA    NA    I629       
4 I629  I629  NA    NA    NA    Unspecified
5 NA    NA    NA    NA    NA    NA     

CodePudding user response:

You can do a nested ifelse statement:

df$primary <- apply(df, 1, function(x) ifelse(all(is.na(x)), NA, 
                                                  ifelse(sum(x == "I629" | is.na(x)) == 5, "Unspecified", x[1])))

Output:

#   dx1  dx2 dx3 dx4 dx5     primary
# 1 I629 <NA>  NA  NA  NA Unspecified
# 2 S065 <NA>  NA  NA  NA        S065
# 3 I629 S066  NA  NA  NA        I629
# 4 I629 I629  NA  NA  NA Unspecified
# 5 <NA> <NA>  NA  NA  NA        <NA>

Data

df <- read.table(text = "dx1   dx2   dx3   dx4   dx5
I629 NA    NA    NA    NA
S065 NA    NA    NA    NA
I629 S066  NA    NA    NA
I629 I629 NA   NA    NA
NA NA NA NA NA", header = TRUE)

More complex

Since you mentioned that there may be more columns, here is an example you only want to consider values from columns with "dx" in the name:

# Identify relevant columns
rel_cols <- grep("dx", names(df))

df$primary <- apply(df[,rel_cols], 1, function(x) ifelse(all(is.na(x)), NA, 
                                                  ifelse(sum(x == "I629" | is.na(x)) == 5, 
                                                         "Unspecified", x[1])))
## Output

#   dx1  dx2 dx3 dx4 dx5  abc     primary
# 1 I629 <NA>  NA  NA  NA <NA> Unspecified
# 2 S065 <NA>  NA  NA  NA  XXX        S065
# 3 I629 S066  NA  NA  NA <NA>        I629
# 4 I629 I629  NA  NA  NA  YYY Unspecified
# 5 <NA> <NA>  NA  NA  NA  ZZZ        <NA>

## Data with "abc" column added
df <- read.table(text = "dx1   dx2   dx3   dx4   dx5 abc
I629 NA    NA    NA    NA NA
S065 NA    NA    NA    NA XXX
I629 S066  NA    NA    NA NA
I629 I629 NA   NA    NA YYY
NA NA NA NA NA ZZZ", header = TRUE)
  • Related