What I need to do is the following: I have a dataframe with a q1_monetary_unit
and value
column like this, let's call it df
:
ID q1_monetary_unit value
1 -999 1000
2 -999 2000
3 10 US dollars -888
4 21 euros -888
Whenever value isn't answered on my country's currency, there's a string that describes the value and the foreign currency. I've also got a dataframe that has the value of each currency in my country's currency like this one:
US dollar Euro
780 850
So my end goal is to turn "10 dollars" into 10 * 780 and 21 euros into 21 * 850, and put those values in the value
column, ending like this:
ID value
1 1000
2 2000
3 7800
4 17850
So, working on my original data, after some transformations I managed to get the following dataframe, let's call it df2
:
ID value
3 7800
4 17850
So, I want to know what's the easiest way to replace value
in my original dataframe when ID
corresponds with the IDs in this new dataframe. Tried doing something like
df %>% mutate(value = case_when(id %in% df2$id~ df2$value,
T ~ value))
But it doesn't work. I could delete the rows from df
and the bind the rows from df2
but that doesn't sound very efficient. In Python I would use the map()
function, but I'm not sure if R's map
works the same way. Any help would be appreciated.
CodePudding user response:
I skipped your df2 phase:
df <- dplyr::tibble(
ID = 1:4,
q1_monetary_unit = c(-999, -999, "10 US dollars", "21 euros"),
value = c(1000, 2000, -888, -888)
)
conversions <- dplyr::tibble(
`US dollars` = 780,
euros = 850 # NB not exact match in you example!
)
# make long for join
conversions_long <-
conversions %>% tidyr::pivot_longer(
everything(), names_to = "currency", values_to = "conversion_factor"
)
df %>%
tidyr::extract(
q1_monetary_unit,
into = c("val", "currency"),
regex = "(\\d ) (.*)", # split the value from the currency name
remove = TRUE, # drop the q1_monetary_unit column
convert = TRUE # convert numeric type automatically
) %>%
dplyr::full_join(
conversions_long, by = "currency" # join by currency
) %>%
mutate(
val = as.double(val), # make val a double as defaulted to int and if_else enforces same type
value = if_else(is.na(val), value, val), # merge val and value to fill missing data
conversion_factor = if_else(is.na(conversion_factor), 1, conversion_factor),
# multiple by 1 if currency it not specified
value = value * conversion_factor
) %>%
dplyr::select(-val) # drop the val column as no longer needed
I didn't know about the recode function though that looks useful!
CodePudding user response:
Managed to do it:
turned df2
into a list and then recoded using it:
value_list = as.list(df2$value)
names(value_list ) = df2$id
df = df%>% mutate(value= recode(df$id, !!!value_list , .default=value))
Any better alternatives are also appreciated
CodePudding user response:
A single regex does the two hard pieces: (1) pull out the currency into a new column (called currency
) with rematch2 and (2) strip the currency label so value_1
can be cast to a number with sub()
.
Inside the parentheses of pattern_currencies
, the meat of the currency is identified & extracted. Junk is stripped out outside the parentheses, such as the "s" and the end of "Euro" or "US dollar" (ie, s?
) and any surrounding white spaces (ie, \\s*
).
If only a few currencies are involved, and they're consistently presented, this shouldn't be hard to maintain over time. If they're not consistent, you'll probably need a second layer of transformation (eg, "USD" and "US Dollar" are both converted to "US dollar"). Remember the join is case-sensitive
# Incoming dataset & exchange rates
ds <-
tibble::tribble(
~ID, ~q1 , ~value_1,
1, "-999" , 1000,
2, "-999" , 2000,
3, "10 US dollars" , -888,
4, "21 Euros" , -888
)
ds_exchange <-
tibble::tribble(
~currency , ~rate,
"US dollar", 780,
"Euro" , 850
)
# Extract & Convert
pattern_currencies <- "\\s*(?<currency>US dollar|Euro)s?\\s*"
ds |>
rematch2::bind_re_match(from = q1, pattern_currencies) |>
dplyr::left_join(ds_exchange, by = "currency") |>
dplyr::mutate(
q1 = dplyr::na_if(q1, "-999"), # Set junk to NA
# value_1 = dplyr::na_if(value_1, "-888"), # Consider setting to NA
rate = dplyr::coalesce(rate, 1), # Fill in missing/unnecessary rates
value_2 = sub(pattern_currencies, "", q1, perl = T), # Strip currency label
value_2 = as.numeric(value_2), # Cast to a number
value_2 = dplyr::coalesce(value_2, value_1),
value_3 = value_2 * rate # Multiple by exchange rate
)
Output:
ID q1 value_1 currency rate value_2 value_3
1 1 <NA> 1000 <NA> 1 1000 1000
2 2 <NA> 2000 <NA> 1 2000 2000
3 3 10 US dollars -888 US dollar 780 10 7800
4 4 21 Euros -888 Euro 850 21 17850