Home > OS >  Best way to map values using some ID using dplyr?
Best way to map values using some ID using dplyr?

Time:11-25

What I need to do is the following: I have a dataframe with a q1_monetary_unit and value column like this, let's call it df:

ID     q1_monetary_unit   value
1          -999           1000
2          -999           2000
3        10 US dollars    -888
4        21 euros         -888

Whenever value isn't answered on my country's currency, there's a string that describes the value and the foreign currency. I've also got a dataframe that has the value of each currency in my country's currency like this one:

US dollar    Euro
   780        850

So my end goal is to turn "10 dollars" into 10 * 780 and 21 euros into 21 * 850, and put those values in the value column, ending like this:

ID     value
1        1000
2        2000
3        7800
4        17850

So, working on my original data, after some transformations I managed to get the following dataframe, let's call it df2:

ID    value
3     7800
4    17850

So, I want to know what's the easiest way to replace value in my original dataframe when ID corresponds with the IDs in this new dataframe. Tried doing something like

df %>% mutate(value = case_when(id %in% df2$id~ df2$value,
                                                   T ~ value))

But it doesn't work. I could delete the rows from df and the bind the rows from df2 but that doesn't sound very efficient. In Python I would use the map() function, but I'm not sure if R's map works the same way. Any help would be appreciated.

CodePudding user response:

I skipped your df2 phase:

df <- dplyr::tibble(
    ID = 1:4,
    q1_monetary_unit = c(-999, -999, "10 US dollars", "21 euros"), 
    value = c(1000, 2000, -888, -888)
)

conversions <- dplyr::tibble(
    `US dollars` = 780,
    euros = 850 # NB not exact match in you example!
)

# make long for join
conversions_long <- 
conversions %>% tidyr::pivot_longer(
    everything(), names_to = "currency", values_to = "conversion_factor"
)

df %>%
    tidyr::extract(
        q1_monetary_unit,
        into = c("val", "currency"),
        regex = "(\\d ) (.*)", # split the value from the currency name
        remove = TRUE, # drop the q1_monetary_unit column
        convert = TRUE # convert numeric type automatically
    ) %>%
    dplyr::full_join(
        conversions_long, by = "currency" # join by currency
    ) %>%
    mutate(
        val = as.double(val), # make val a double as defaulted to int and if_else enforces same type
        value = if_else(is.na(val), value, val), # merge val and value to fill missing data
        conversion_factor = if_else(is.na(conversion_factor), 1, conversion_factor),
        # multiple by 1 if currency it not specified
        value = value * conversion_factor
    ) %>%
    dplyr::select(-val) # drop the val column as no longer needed

I didn't know about the recode function though that looks useful!

CodePudding user response:

Managed to do it:

turned df2 into a list and then recoded using it:

value_list = as.list(df2$value)
names(value_list ) = df2$id
df = df%>% mutate(value= recode(df$id, !!!value_list , .default=value))

Any better alternatives are also appreciated

CodePudding user response:

A single regex does the two hard pieces: (1) pull out the currency into a new column (called currency) with rematch2 and (2) strip the currency label so value_1 can be cast to a number with sub().

Inside the parentheses of pattern_currencies, the meat of the currency is identified & extracted. Junk is stripped out outside the parentheses, such as the "s" and the end of "Euro" or "US dollar" (ie, s?) and any surrounding white spaces (ie, \\s*).

If only a few currencies are involved, and they're consistently presented, this shouldn't be hard to maintain over time. If they're not consistent, you'll probably need a second layer of transformation (eg, "USD" and "US Dollar" are both converted to "US dollar"). Remember the join is case-sensitive

# Incoming dataset & exchange rates
ds <- 
  tibble::tribble(
    ~ID, ~q1               , ~value_1,
    1,   "-999"            ,  1000,
    2,   "-999"            ,  2000,
    3,   "10 US dollars"   ,  -888,
    4,   "21 Euros"        ,  -888
  )
ds_exchange <- 
  tibble::tribble(
    ~currency  , ~rate,
    "US dollar",  780,
    "Euro"     ,  850
  )

# Extract & Convert
pattern_currencies <- "\\s*(?<currency>US dollar|Euro)s?\\s*"
ds |> 
  rematch2::bind_re_match(from = q1, pattern_currencies) |> 
  dplyr::left_join(ds_exchange, by = "currency") |> 
  dplyr::mutate(
    q1      = dplyr::na_if(q1, "-999"),                   # Set junk to NA
    # value_1 = dplyr::na_if(value_1, "-888"),            # Consider setting to NA
    rate    = dplyr::coalesce(rate, 1),                   # Fill in missing/unnecessary rates
    
    value_2 = sub(pattern_currencies, "", q1, perl = T),  # Strip currency label
    value_2 = as.numeric(value_2),                        # Cast to a number
    value_2 = dplyr::coalesce(value_2, value_1),
    value_3 = value_2 * rate                              # Multiple by exchange rate
  )

Output:

  ID            q1 value_1  currency rate value_2 value_3
1  1          <NA>    1000      <NA>    1    1000    1000
2  2          <NA>    2000      <NA>    1    2000    2000
3  3 10 US dollars    -888 US dollar  780      10    7800
4  4      21 Euros    -888      Euro  850      21   17850
  • Related