for a side project, I am currently trying to make a dashboard based on this data (a kind of scatterplot) where there is a date and how the prices change over time. I am currently in one of the hardest states (data cleaning).
I am trying to split the variables in order to extract information. For example, in the variable departure_info
I want to split into the variable departure_info_time
and departure_info_day
. Furthermore, in the variable price
I want to extract the numbers such as 358, 480, 590, etc.
df1 <- data.frame(depart = c("OSL", "WAW", "VIE", "MUC", "FRA"),
destination = c("KEF", "ARN", "RIX", "VCE", "OSL"),
departue_info = c("['12:45 am Sa 19 Feb']", "['07:55 am Sa 19 Feb']", "['09:05 am Sa 19 Feb']", "['21:45 am Sa 19 Feb', '15:30 am Sa 19 Feb']", "['10:25 am Sa 19 Feb', '16:10 am Sa 19 Feb', '21:40 am Sa 19 Feb']"),
price = c("['358<U 0080>']", "['480<U 0080>']", "['590<U 0080>']", "['354<U 0080>', '418<U 0080>']", "['249<U 0080>', '249<U 0080>', '249<U 0080>', '419<U 0080>']"))
I would appreciate if someone can help me. I tried with str_extract()
and gsub()
but I could not succeed. I also would thank if someone can give me an advice what I have to do, if in a row I have several prices in another row just one.
I thank you for your help :)
CodePudding user response:
It looks like each item in the departure_info
has a fixed number of characters, so I use substr
to extract the substring from the 3rd to 10th character for the time and 12th to 20th for the date.
We could take the same approach for price, but it's a simple regex pattern to look for one or more numbers, "\\d "
, so I just use that with str_extract
, which will extract the first occurrence of the pattern, that is the first set of consecutive numbers in the price
string.
library(stringr)
library(dplyr)
df1 %>%
mutate(
depart_time = substr(departue_info, start = 3, stop = 10),
depart_date = substr(departue_info, start = 12, stop = 20),
price_num = as.numeric(str_extract(price, "\\d "))
)
# depart destination departue_info
# 1 OSL KEF ['12:45 am Sa 19 Feb']
# 2 WAW ARN ['07:55 am Sa 19 Feb']
# 3 VIE RIX ['09:05 am Sa 19 Feb']
# 4 MUC VCE ['21:45 am Sa 19 Feb', '15:30 am Sa 19 Feb']
# 5 FRA OSL ['10:25 am Sa 19 Feb', '16:10 am Sa 19 Feb', '21:40 am Sa 19 Feb']
# price depart_time depart_date price_num
# 1 ['358<U 0080>'] 12:45 am Sa 19 Feb 358
# 2 ['480<U 0080>'] 07:55 am Sa 19 Feb 480
# 3 ['590<U 0080>'] 09:05 am Sa 19 Feb 590
# 4 ['354<U 0080>', '418<U 0080>'] 21:45 am Sa 19 Feb 354
# 5 ['249<U 0080>', '249<U 0080>', '249<U 0080>', '419<U 0080>'] 10:25 am Sa 19 Feb 249
If you want to get all the price numbers, we'll first get rid of the U0080
s and then use str_extract_all
instead of str_extract
. This will generate a list
column, which you can then use in various ways...
df1 %>%
select(price) %>%
mutate(
price_num = price %>% str_replace_all(fixed("U 0080"), "") %>% str_extract_all("\\d ")
)
# price price_num
# 1 ['358<U 0080>'] 358
# 2 ['480<U 0080>'] 480
# 3 ['590<U 0080>'] 590
# 4 ['354<U 0080>', '418<U 0080>'] 354, 418
# 5 ['249<U 0080>', '249<U 0080>', '249<U 0080>', '419<U 0080>'] 249, 249, 249, 419