Home > Enterprise >  How do you clean a column with mixed height units in r
How do you clean a column with mixed height units in r

Time:12-01

I have a column in my dataframe from a survey that has two different units in it. I need to remove these and convert the info into a consistent unit ie a column of all cm without the unit being present.

Here is some sample data

df <- data.frame(v1 = c('100 cm', '6 foot 10', '200 cm', '5 foot 11')

I attempted to use this readr::parse_number(df$v1) but that would turn '6 foot 10' into 6. I'm not sure it's that helpful anyway because I still need to convert the heights recorded as feet and inches into cm

CodePudding user response:

You can split the strings by foot ('6 foot 10' >> c(6, 10)), and then map through the items, parse them as numbers, and -- if they have length 2 (ie, are the result of a string split) -- multiply them by the conversion rates.

library(tidyverse)
df$v1 %>%
  strsplit(" foot ") %>%
  map_dbl(function(num){
    num = parse_number(num)
    if(length(num) == 2) {sum(num * c(30.48, 2.54))} else {num}})

[1] 100.00 208.28 200.00 180.34

Another option, but less elegant in my opinion, is to change the '6 foot 10' to '6*30.48 2.54*10' (multiply the #feet and #inches with their conversion rates) using stringr::str_replace_all, and then evaluate those strings as expressions with eval parse (with a mapping function like sapply or purrr::map, because these functions are not vectorised).

df$v1 %>%
  str_replace_all(c(" cm" = "", " foot " = "*30.48   2.54*")) %>%
  map_dbl(~ parse(text = .x) %>% eval())

[1] 100.00 208.28 200.00 180.34

CodePudding user response:

Maybe something like this (with tidyverse):

df %>%  mutate(
  units = case_when(
    str_detect(v1, ".cm") ~ "cm",
    str_detect(v1, ".foot") ~ "foot"
  )
)

Then you get a unit column and based on that can apply a conversion factor.

  • Related