R, selecting columns with a specific string, and then replacing it with another string-CodePudding

Currently, a part of my r assignment is to take a set of time zones and then adding a new column that shows its location. For example, if Timezone = "GMT 1:00 Europe/Andorra", then Timezone_Continent = "Europe". Picture of the dataframe. How can I select specific columns with keywords like "Europe" or "America" in a way that I can use them with gsub?

df["Timezone_Continent"] <- df$Timezone

ame <- df$Timezone[grep("America", df$Timezone)]

df["Timezone_Continent"] <- gsub(ame, "America", df["Timezone_Continent"])

This is what I currently have, and I know that the second line wouldnt work for gsub. I was sort of just experimenting.

CodePudding user response：

Another option:

library(tidyverse)

df |>
  mutate(TZ = str_extract(Timezone, "(?<=\\d\\s).*?(?=\\/)"))
#> # A tibble: 10 x 2
#>    Timezone                      TZ     
#>    <chr>                         <chr>  
#>  1 GMT-08:00 America/Los_Angeles America
#>  2 GMT 1:00 Europe/Andorra       Europe 
#>  3 GMT-08:00 America/Los_Angeles America
#>  4 GMT-08:00 America/Los_Angeles America
#>  5 GMT 1:00 Europe/Andorra       Europe 
#>  6 GMT-08:00 America/Los_Angeles America
#>  7 GMT-08:00 America/Los_Angeles America
#>  8 GMT-08:00 America/Los_Angeles America
#>  9 GMT 1:00 Europe/Andorra       Europe 
#> 10 GMT 1:00 Europe/Andorra       Europe

CodePudding user response：

First define your lookup values

lookup <- c("America", "Europe")

Then create a new column from the lookup:

df$TZ <- apply(sapply(lookup, \(x) grepl(x, df$Timezone)), 1, \(x) lookup[x])

df
#>                         Timezone      TZ
#> 1  GMT-08:00 America/Los_Angeles America
#> 2        GMT 1:00 Europe/Andorra  Europe
#> 3  GMT-08:00 America/Los_Angeles America
#> 4  GMT-08:00 America/Los_Angeles America
#> 5        GMT 1:00 Europe/Andorra  Europe
#> 6  GMT-08:00 America/Los_Angeles America
#> 7  GMT-08:00 America/Los_Angeles America
#> 8  GMT-08:00 America/Los_Angeles America
#> 9        GMT 1:00 Europe/Andorra  Europe
#> 10       GMT 1:00 Europe/Andorra  Europe

^{Created on 2022-10-23 with reprex v2.0.2}

Data used

set.seed(1)

df <- data.frame(Timezone = sample(c("GMT-08:00 America/Los_Angeles",
                                     "GMT 1:00 Europe/Andorra"), 10, TRUE))
df
#>                         Timezone
#> 1  GMT-08:00 America/Los_Angeles
#> 2        GMT 1:00 Europe/Andorra
#> 3  GMT-08:00 America/Los_Angeles
#> 4  GMT-08:00 America/Los_Angeles
#> 5        GMT 1:00 Europe/Andorra
#> 6  GMT-08:00 America/Los_Angeles
#> 7  GMT-08:00 America/Los_Angeles
#> 8  GMT-08:00 America/Los_Angeles
#> 9        GMT 1:00 Europe/Andorra
#> 10       GMT 1:00 Europe/Andorra

CodePudding user response：

The solution provided by @Allan generally correct, however it introduces unnecessary looping through the column values, when using vectorized function. With Alan's data, the solution can be simplified to one line:

gsub('^GMT. \\s(.*)/\\w $', '\\1', df$Timezone)

# [1] "America" "Europe"  "America" "America" "Europe"  "America" "America" "America" "Europe"  "Europe"