I'd like to extract weight values from strings with the unit and the time of measurement using tidyverse.
My dataset is like as below:
df <- tibble(ID = c("A","B","C"),
Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))
------
A tibble: 3 × 2
ID Weight
<chr> <chr>
1 A 45kg^20221120
2 B 11.5kg^20221015
3 C 66.05kg^20221020
I use stringr in the tidyverse package with regular expressions.
library(tidyverse)
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d \\.\\d )|(\\d )(?=kg)")))
----------
A tibble: 3 × 3
ID Measurement Weight
<chr> <chr> <dbl>
1 A 45kg^20221120 45
2 B 11.5kg^20221015 11.5
3 C 66.05kg^20221020 66.0
The second decimal place of C (.05) doesn't extracted. What's wrong with my code? Any answers or comments are welcome.
Thanks.
CodePudding user response:
Yes, it was extracted, however tibble is rounding it for 66.0 for easy display.
You can see it if you transform it in data.frame
or if you View
it
Solution
CodePudding user response:
You could try to pull all the data out of the string at once with extract
:
library(tidyverse)
df <- tibble(ID = c("A","B","C"),
Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))
df |>
extract(col = Weight,
into = c("weight", "unit", "date"),
regex = "(.*)(kg)\\^(.*$)",
remove = TRUE,
convert = TRUE) |>
mutate(date = lubridate::ymd(date))
#> # A tibble: 3 x 4
#> ID weight unit date
#> <chr> <dbl> <chr> <date>
#> 1 A 45 kg 2022-11-20
#> 2 B 51.5 kg 2022-10-15
#> 3 C 66.0 kg 2022-10-20
Note that, as stated in the comments, the .05 is just not printing, but is present in the data.