Currently, a part of my r assignment is to take a set of time zones and then adding a new column that shows its location. For example, if Timezone = "GMT 1:00 Europe/Andorra"
, then Timezone_Continent = "Europe"
. Picture of the dataframe. How can I select specific columns with keywords like "Europe" or "America" in a way that I can use them with gsub?
df["Timezone_Continent"] <- df$Timezone
ame <- df$Timezone[grep("America", df$Timezone)]
df["Timezone_Continent"] <- gsub(ame, "America", df["Timezone_Continent"])
This is what I currently have, and I know that the second line wouldnt work for gsub. I was sort of just experimenting.
CodePudding user response:
Another option:
library(tidyverse)
df |>
mutate(TZ = str_extract(Timezone, "(?<=\\d\\s).*?(?=\\/)"))
#> # A tibble: 10 x 2
#> Timezone TZ
#> <chr> <chr>
#> 1 GMT-08:00 America/Los_Angeles America
#> 2 GMT 1:00 Europe/Andorra Europe
#> 3 GMT-08:00 America/Los_Angeles America
#> 4 GMT-08:00 America/Los_Angeles America
#> 5 GMT 1:00 Europe/Andorra Europe
#> 6 GMT-08:00 America/Los_Angeles America
#> 7 GMT-08:00 America/Los_Angeles America
#> 8 GMT-08:00 America/Los_Angeles America
#> 9 GMT 1:00 Europe/Andorra Europe
#> 10 GMT 1:00 Europe/Andorra Europe
CodePudding user response:
First define your lookup values
lookup <- c("America", "Europe")
Then create a new column from the lookup:
df$TZ <- apply(sapply(lookup, \(x) grepl(x, df$Timezone)), 1, \(x) lookup[x])
df
#> Timezone TZ
#> 1 GMT-08:00 America/Los_Angeles America
#> 2 GMT 1:00 Europe/Andorra Europe
#> 3 GMT-08:00 America/Los_Angeles America
#> 4 GMT-08:00 America/Los_Angeles America
#> 5 GMT 1:00 Europe/Andorra Europe
#> 6 GMT-08:00 America/Los_Angeles America
#> 7 GMT-08:00 America/Los_Angeles America
#> 8 GMT-08:00 America/Los_Angeles America
#> 9 GMT 1:00 Europe/Andorra Europe
#> 10 GMT 1:00 Europe/Andorra Europe
Created on 2022-10-23 with reprex v2.0.2
Data used
set.seed(1)
df <- data.frame(Timezone = sample(c("GMT-08:00 America/Los_Angeles",
"GMT 1:00 Europe/Andorra"), 10, TRUE))
df
#> Timezone
#> 1 GMT-08:00 America/Los_Angeles
#> 2 GMT 1:00 Europe/Andorra
#> 3 GMT-08:00 America/Los_Angeles
#> 4 GMT-08:00 America/Los_Angeles
#> 5 GMT 1:00 Europe/Andorra
#> 6 GMT-08:00 America/Los_Angeles
#> 7 GMT-08:00 America/Los_Angeles
#> 8 GMT-08:00 America/Los_Angeles
#> 9 GMT 1:00 Europe/Andorra
#> 10 GMT 1:00 Europe/Andorra
CodePudding user response:
The solution provided by @Allan generally correct, however it introduces unnecessary looping through the column values, when using vectorized function. With Alan's data, the solution can be simplified to one line:
gsub('^GMT. \\s(.*)/\\w $', '\\1', df$Timezone)
# [1] "America" "Europe" "America" "America" "Europe" "America" "America" "America" "Europe" "Europe"