Home > Back-end >  Places after decimal points discarded when extracting numbers from strings
Places after decimal points discarded when extracting numbers from strings

Time:11-23

I'd like to extract weight values from strings with the unit and the time of measurement using tidyverse.

My dataset is like as below:

df <- tibble(ID = c("A","B","C"), 
             Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))

------
A tibble: 3 × 2
  ID    Weight          
  <chr> <chr>           
1 A     45kg^20221120   
2 B     11.5kg^20221015 
3 C     66.05kg^20221020

I use stringr in the tidyverse package with regular expressions.

library(tidyverse)
df %>%
  mutate(Weight = as.numeric(str_extract(Measurement, "(\\d \\.\\d )|(\\d )(?=kg)")))

----------
A tibble: 3 × 3
  ID    Measurement      Weight
  <chr> <chr>             <dbl>
1 A     45kg^20221120      45  
2 B     11.5kg^20221015    11.5
3 C     66.05kg^20221020   66.0

The second decimal place of C (.05) doesn't extracted. What's wrong with my code? Any answers or comments are welcome.

Thanks.

CodePudding user response:

Yes, it was extracted, however tibble is rounding it for 66.0 for easy display.

You can see it if you transform it in data.frame or if you View it

Solution

enter image description here

CodePudding user response:

You could try to pull all the data out of the string at once with extract:

library(tidyverse)

df <- tibble(ID = c("A","B","C"), 
             Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))

df |>
  extract(col = Weight, 
          into = c("weight", "unit", "date"),
          regex = "(.*)(kg)\\^(.*$)", 
          remove = TRUE, 
          convert = TRUE) |>
  mutate(date = lubridate::ymd(date))
#> # A tibble: 3 x 4
#>   ID    weight unit  date      
#>   <chr>  <dbl> <chr> <date>    
#> 1 A       45   kg    2022-11-20
#> 2 B       51.5 kg    2022-10-15
#> 3 C       66.0 kg    2022-10-20

Note that, as stated in the comments, the .05 is just not printing, but is present in the data.

  • Related