Home > database >  Is there a function in r to pick out columns which contain part of a value?
Is there a function in r to pick out columns which contain part of a value?

Time:10-14

I am working on an assignment where one of my columns contains measurements in feet, cm, and m, and I am trying to convert them all to metres. So far I have been able to convert individual cells to simple numeric values (e.g. 5_ft_7 to 5.7), but I cannot find a function that will convert them to metres, especially without modifying the values that are already in centimetres/metres.

My question is, is there a way of targeting JUST the cells that contain 'ft' without specifying each of them individually?

Dataset (hopefully this helps):

> Data_original3$Height
  [1] "5.7"    "157_cm" "5.11"   "167_cm" "1.65_m" "187_cm" "1.71_m" "188_cm" "5.2"   
 [10] "5.5"    "5.7"    "155_cm" "5.4"    "163_cm" "6.4"    "170_cm" "5.7"    "5.8"   
 [19] "186_cm" "5.1"    "5.3"    "5.3"    "5.7"    "5.8"    "6.2"    "175_cm" "5.6"   
 [28] "5.7"    "180_cm" "5.6"    "160_cm" "163_cm" "5.6"    "163_cm" "5.7"    "175_cm"
 [37] "165_cm" "5.7"    "5.6"    "5.11"   "188_cm" "5.6"    "5.3"    "5.5"    "5.4"   
 [46] "5.6"    "180_cm" "5.9"    "165_cm" "5.6"    "180_cm" "165_cm" "175_cm" "5.4"   
 [55] "167_cm" "175_cm" "5.7"    "5.11"   "5.11"   "5.5"    "6.1"    "1.68_m" "5.4"   
 [64] "5.7"    "5.3"    "5.5"    "5.9"    "5.9"    "5.4"    "5.6"    "5.8"    "5.5"   
 [73] "5.9"    "6.3"    "6.1"    "5.8"    "5.2"    "5.2"    "6.0"    "166_cm" "5.3"   
 [82] NA       "166_cm" "1.88_m" "5.6"    "5.10"   "171_cm" "5.1"    "170_cm" "178_cm"
 [91] "5.2"    "185_cm" "5.11"   "5.9"    "5.11"   "5.7"    "6.0"    "6.1"    "176_cm"
[100] "5.7"    "189_cm" "5.3"    "5.7"    "164_cm" "5.6"    "5.8"    NA       NA      
[109] "175_cm" "157_cm" "5.10"   "172_cm" "170_cm" "5.7"    "5.8"    "5.6"    "169_cm"
[118] "6.2"    "6.4"    "1.71_m" "5.10"   "1.67_m" "5.2"    "160_cm" "5.8"    "6.2"   
[127] "5.5"    "180_cm" "175_cm" "5.0"    "195_cm" "5.5"    "6.0"    "175_cm"

Thank you

CodePudding user response:

Try this, where my vec is your $Height column:

U <- gsub("[0-9._]", "", vec)
head(U)
# [1] ""   "cm" ""   "cm" "m"  "cm"

as.numeric(gsub("[^0-9.]", "", vec)) *
  sapply(U, switch, m = 1, cm = 1/100, 1)
#        cm        cm    m   cm    m   cm                  cm        cm        cm             cm                          
# 5.70 1.57 5.11 1.67 1.65 1.87 1.71 1.88 5.20 5.50 5.70 1.55 5.40 1.63 6.40 1.70 5.70 5.80 1.86 5.10 5.30 5.30 5.70 5.80 
#        cm             cm        cm   cm        cm        cm   cm                  cm                            cm      
# 6.20 1.75 5.60 5.70 1.80 5.60 1.60 1.63 5.60 1.63 5.70 1.75 1.65 5.70 5.60 5.11 1.88 5.60 5.30 5.50 5.40 5.60 1.80 5.90 
#   cm        cm   cm   cm        cm   cm                             m                                                   
# 1.65 5.60 1.80 1.65 1.75 5.40 1.67 1.75 5.70 5.11 5.11 5.50 6.10 1.68 5.40 5.70 5.30 5.50 5.90 5.90 5.40 5.60 5.80 5.50 
#                                      cm      <NA>   cm    m             cm        cm   cm        cm                     
# 5.90 6.30 6.10 5.80 5.20 5.20 6.00 1.66 5.30   NA 1.66 1.88 5.60 5.10 1.71 5.10 1.70 1.78 5.20 1.85 5.11 5.90 5.11 5.70 
#             cm        cm             cm           <NA> <NA>   cm   cm        cm   cm                  cm              m 
# 6.00 6.10 1.76 5.70 1.89 5.30 5.70 1.64 5.60 5.80   NA   NA 1.75 1.57 5.10 1.72 1.70 5.70 5.80 5.60 1.69 6.20 6.40 1.71 
#         m        cm                  cm   cm        cm             cm 
# 5.10 1.67 5.20 1.60 5.80 6.20 5.50 1.80 1.75 5.00 1.95 5.50 6.00 1.75 

You can ignore the names, it's the values that are important.

unname(as.numeric(gsub("[^0-9.]", "", vec)) *
  sapply(U, switch, m = 1, cm = 1/100, 1))
#   [1] 5.70 1.57 5.11 1.67 1.65 1.87 1.71 1.88 5.20 5.50 5.70 1.55 5.40 1.63 6.40 1.70 5.70 5.80 1.86 5.10 5.30 5.30 5.70
#  [24] 5.80 6.20 1.75 5.60 5.70 1.80 5.60 1.60 1.63 5.60 1.63 5.70 1.75 1.65 5.70 5.60 5.11 1.88 5.60 5.30 5.50 5.40 5.60
#  [47] 1.80 5.90 1.65 5.60 1.80 1.65 1.75 5.40 1.67 1.75 5.70 5.11 5.11 5.50 6.10 1.68 5.40 5.70 5.30 5.50 5.90 5.90 5.40
#  [70] 5.60 5.80 5.50 5.90 6.30 6.10 5.80 5.20 5.20 6.00 1.66 5.30   NA 1.66 1.88 5.60 5.10 1.71 5.10 1.70 1.78 5.20 1.85
#  [93] 5.11 5.90 5.11 5.70 6.00 6.10 1.76 5.70 1.89 5.30 5.70 1.64 5.60 5.80   NA   NA 1.75 1.57 5.10 1.72 1.70 5.70 5.80
# [116] 5.60 1.69 6.20 6.40 1.71 5.10 1.67 5.20 1.60 5.80 6.20 5.50 1.80 1.75 5.00 1.95 5.50 6.00 1.75

Walk-through:

  • as.numeric(gsub("[^0-9.]", "", vec)) extracts just the number components
  • U is the units extracted from each, where empty strings "" means there was no unit applied.
  • switch(U[1], m = 1, cm = 1/100, 1) would check the first U unit and return a conversion into meters; the trailing unnamed 1 is the default assigned if U[1] is not one of the known strings "cm" and "m", which we'll use as 1 (feet).
  • because switch is not vectorized, I use sapply(U, switch, ...) to vectorize its effect, and it returns a vector of multipliers to apply to the numbers extracted with as.numeric(.)

Data

vec <- c("5.7", "157_cm", "5.11", "167_cm", "1.65_m", "187_cm", "1.71_m", "188_cm", "5.2", "5.5", "5.7", "155_cm", "5.4", "163_cm", "6.4", "170_cm", "5.7", "5.8", "186_cm", "5.1", "5.3", "5.3", "5.7", "5.8", "6.2", "175_cm", "5.6", "5.7", "180_cm", "5.6", "160_cm", "163_cm", "5.6", "163_cm", "5.7", "175_cm", "165_cm", "5.7", "5.6", "5.11", "188_cm", "5.6", "5.3", "5.5", "5.4", "5.6", "180_cm", "5.9", "165_cm", "5.6", "180_cm", "165_cm", "175_cm", "5.4", "167_cm", "175_cm", "5.7", "5.11", "5.11", "5.5", "6.1", "1.68_m", "5.4", "5.7", "5.3", "5.5", "5.9", "5.9", "5.4", "5.6", "5.8", "5.5", "5.9", "6.3", "6.1", "5.8", "5.2", "5.2", "6.0", "166_cm", "5.3", NA, "166_cm", "1.88_m", "5.6", "5.10", "171_cm", "5.1", "170_cm", "178_cm", "5.2", "185_cm", "5.11", "5.9", "5.11", "5.7", "6.0", "6.1", "176_cm", "5.7", "189_cm", "5.3", "5.7", "164_cm", "5.6", "5.8", NA, NA, "175_cm", "157_cm", "5.10", "172_cm", "170_cm", "5.7", "5.8", "5.6", "169_cm", "6.2", "6.4", "1.71_m", "5.10", "1.67_m", "5.2", "160_cm", "5.8", "6.2", "5.5", "180_cm", "175_cm", "5.0", "195_cm", "5.5", "6.0", "175_cm")

CodePudding user response:

You can just select the rows that contain ft or in the Data_Original$Height data you showed, select the rows where there is no _ and change only these rows.

Data_Original$Height[grepl(pattern = "_", Data_Original$Height) == F] <- round(as.numeric(Data_Original$Height[grepl(pattern = "_", Data_Original$Height) == F])*0.3048, 2)

And if you want to use the data with the 'ft' still labelled

Data_Original$Height[grepl(pattern = "ft", Data_Original$Height)] <- round(as.numeric(gsub("[^0-9.]", "", Data_Original$Height[grepl(pattern = "ft", Data_Original$Height)]))*0.3048, 2)

CodePudding user response:

Here is the tidyverse way:

Data_original3 %>%
  separate(Height, c('feet', 'inches'), "_ft_", convert = TRUE, remove = FALSE) %>%
  mutate(Height = if_else(grepl("ft", Height), paste0(round((12*as.numeric(feet)   inches)*2.54/100, digits = 2), "_m"), paste0(Height))) %>%
  select(Height)

Using separate, I split feet and inches into two columns, then perform the conversion calculation back into the original "Height" column.

Note: This only works if your still splitting by "_ft_" instead of "."

  •  Tags:  
  • r
  • Related