I was trying to do transform some datasets in R when I found the following issue: I have got a char column that shows the income of some people (a census). So what I was trying to do is to standardize the data for future analysis. This is a sample of the data:
income |
---|
2000,3 Thousand Euros |
50,14 Thousand Euros |
54000 Euros |
This is what I am expecting:
income |
---|
2000.3 k€ |
50.14 k€ |
54 k€ |
And finally, this is the code I have got so far, but it still not working. I am new in R and I am still searching for methods. To clarify, in the if statement what I was trying is to search all those values that have more than 4 digits, but I think it is easier to search the ones which have " Euros". But to make operations, I believe I have to transform the char column into an integer one, so the " Euros" regex will not be valid (I believe).
census$income <- str_replace_all(census$income, " Thousand Euros", '')
census$income <- str_replace_all(census$income, " Euros", '')
census$income <- as.integer(census$income)
if(floor(log10(census$income)) 1>4){
census$income/1000
}
census$income <- as.character(census$income)
Thank you very much for any help! =)
CodePudding user response:
I think you can accomplish this with a combination of readr::parse_number
and str_detect(tolower(income), "thousand")
.
census %>%
mutate(
parsed_income = if_else(
str_detect(tolower(income), "thousand"),
parse_number(income),
1000 * parse_number(income)
)
)
CodePudding user response:
A solution with nested sub
:
dyplyr
library(dplyr)
df %>%
mutate(income = sub("(000\\s|\\sThousand\\s)?Euros", " k€",
sub(",", ".", income)))
income
1 2000.3 k€
2 50.14 k€
3 54 k€
base R
:
df$income <- sub("(000\\s|\\sThousand\\s)?Euros", " k€",
sub(",", ".", df$income))
Data:
df <- data.frame(
income = c("2000,3 Thousand Euros","50,14 Thousand Euros","54000 Euros")
)