I have this dataset:
sample_data = data.frame(col1 = c("james", "john", "henry"), col2 = c("123 forest road", "jason", "tim"), col3 = c("NA", "124 valley street", "peter"), col4 = c("NA", "NA", "125 ocean road") )
col1 col2 col3 col4
james 123 forest road NA NA
john jason 124 valley street NA
henry tim peter 125 ocean road
I want to try and figure out a way in which the second column always contains the "address" - the final product would look this would look something like this:
# code to show sample of desired result
desired_result = data.frame(col1 = c("james", "john", "henry"), col2 = c("123 forest road", "124 valley street", "125 ocean road"))
col1 col2
james 123 forest road
john 124 valley street
henry 125 ocean road
I have been trying to think of and research functions in R that are able to "scan" if the value contained within a row/column starts with a number, and make a decision accordingly.
I had the following idea - I can check to see if a given column starts with a number or not:
sample_data$is_col2_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col2,1,1))
sample_data$is_col3_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col3,1,1))
sample_data$is_col4_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col4,1,1))
col1 col2 col3 col4 is_col2_a_number is_col3_a_number is_col4_a_number
1 james 123 forest road NA NA TRUE FALSE FALSE
2 john jason 124 valley street NA FALSE TRUE FALSE
3 henry tim peter 125 ocean road FALSE FALSE TRUE
Next, I would try to figure out how to code the following logic:
- For a given row, find the first cell that contains the value TRUE
- Keep the column corresponding to that condition
I tried this row-by-row:
first_row = sample_data[1,]
ifelse(first_row$is_col2_a_number == "TRUE", first_row[,c(1,2)], ifelse(first_row$is_col3_a_number, first_row[, c(1,3)], first_row[, c(1,4)]))
But I think I have made this unnecessarily complicated. Can someone please give me a hand and suggest how I can continue solving this problem?
Thank you!
CodePudding user response:
This should work:
library(dplyr)
library(tidyr)
library(stringr)
sample_data = data.frame(col1 = c("james", "john", "henry"), col2 = c("123 forest road", "jason", "tim"), col3 = c("NA", "124 valley street", "peter"), col4 = c("NA", "NA", "125 ocean road") )
tmp <- sample_data %>%
mutate(across(col2:col4, ~case_when(str_detect(.x, "^\\d") ~ .x,
TRUE ~ NA_character_)),
address = coalesce(col2, col3, col4)) %>%
select(col1, address)
tmp
#> col1 address
#> 1 james 123 forest road
#> 2 john 124 valley street
#> 3 henry 125 ocean road
Created on 2022-06-30 by the reprex package (v2.0.1)
CodePudding user response:
I thought of a (very ineffective) way to solve my own problem!
sample_data = data.frame(col1 = c("james", "john", "henry"), col2 = c("123 forest road", "jason", "tim"), col3 = c("NA", "124 valley street", "peter"), col4 = c("NA", "NA", "125 ocean road") )
sample_data$is_col2_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col2,1,1))
sample_data$is_col3_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col3,1,1))
sample_data$is_col4_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col4,1,1))
a1 <- sample_data[which(sample_data$is_col2_a_number == "TRUE"), ]
a1 <- a1[,c(1,2)]
colnames(a1)[2] <- "i"
b1 <- sample_data[which(sample_data$is_col3_a_number == "TRUE"), ]
b1 <- b1[,c(1,3)]
colnames(b1)[2] <- "i"
c1 <- sample_data[which(sample_data$is_col4_a_number == "TRUE"), ]
c1 <- c1[,c(1,4)]
colnames(c1)[2] <- "i"
final = rbind(a1,b1,c1)
Here is the desired output:
col1 i
1 james 123 forest road
2 john 124 valley street
3 henry 125 ocean road