Extracting particular information of character variables-CodePudding

for a side project, I am currently trying to make a dashboard based on this data (a kind of scatterplot) where there is a date and how the prices change over time. I am currently in one of the hardest states (data cleaning). I am trying to split the variables in order to extract information. For example, in the variable departure_info I want to split into the variable departure_info_time and departure_info_day. Furthermore, in the variable price I want to extract the numbers such as 358, 480, 590, etc.

df1 <- data.frame(depart = c("OSL", "WAW", "VIE", "MUC", "FRA"),
                  destination = c("KEF", "ARN", "RIX", "VCE", "OSL"),
                  departue_info = c("['12:45 am Sa 19 Feb']", "['07:55 am Sa 19 Feb']", "['09:05 am Sa 19 Feb']", "['21:45 am Sa 19 Feb', '15:30 am Sa 19 Feb']", "['10:25 am Sa 19 Feb', '16:10 am Sa 19 Feb', '21:40 am Sa 19 Feb']"),
                  price = c("['358<U 0080>']", "['480<U 0080>']", "['590<U 0080>']", "['354<U 0080>', '418<U 0080>']", "['249<U 0080>', '249<U 0080>', '249<U 0080>', '419<U 0080>']"))

I would appreciate if someone can help me. I tried with str_extract() and gsub() but I could not succeed. I also would thank if someone can give me an advice what I have to do, if in a row I have several prices in another row just one.

I thank you for your help :)

CodePudding user response：

It looks like each item in the departure_info has a fixed number of characters, so I use substr to extract the substring from the 3rd to 10th character for the time and 12th to 20th for the date.

We could take the same approach for price, but it's a simple regex pattern to look for one or more numbers, "\\d ", so I just use that with str_extract, which will extract the first occurrence of the pattern, that is the first set of consecutive numbers in the price string.

library(stringr)
library(dplyr)
df1 %>%
  mutate(
    depart_time = substr(departue_info, start = 3, stop = 10),
    depart_date = substr(departue_info, start = 12, stop = 20),
    price_num = as.numeric(str_extract(price, "\\d "))
  )
#   depart destination                                                      departue_info
# 1    OSL         KEF                                             ['12:45 am Sa 19 Feb']
# 2    WAW         ARN                                             ['07:55 am Sa 19 Feb']
# 3    VIE         RIX                                             ['09:05 am Sa 19 Feb']
# 4    MUC         VCE                       ['21:45 am Sa 19 Feb', '15:30 am Sa 19 Feb']
# 5    FRA         OSL ['10:25 am Sa 19 Feb', '16:10 am Sa 19 Feb', '21:40 am Sa 19 Feb']
#                                                          price depart_time depart_date price_num
# 1                                              ['358<U 0080>']    12:45 am   Sa 19 Feb       358
# 2                                              ['480<U 0080>']    07:55 am   Sa 19 Feb       480
# 3                                              ['590<U 0080>']    09:05 am   Sa 19 Feb       590
# 4                               ['354<U 0080>', '418<U 0080>']    21:45 am   Sa 19 Feb       354
# 5 ['249<U 0080>', '249<U 0080>', '249<U 0080>', '419<U 0080>']    10:25 am   Sa 19 Feb       249

If you want to get all the price numbers, we'll first get rid of the U0080s and then use str_extract_all instead of str_extract. This will generate a list column, which you can then use in various ways...

df1 %>%
  select(price) %>%
  mutate(
    price_num = price %>% str_replace_all(fixed("U 0080"), "") %>% str_extract_all("\\d ")
  )
#                                                         price          price_num
# 1                                              ['358<U 0080>']                358
# 2                                              ['480<U 0080>']                480
# 3                                              ['590<U 0080>']                590
# 4                               ['354<U 0080>', '418<U 0080>']           354, 418
# 5 ['249<U 0080>', '249<U 0080>', '249<U 0080>', '419<U 0080>'] 249, 249, 249, 419