Home > database >  strptime - set format with data in csv file
strptime - set format with data in csv file

Time:10-04

Example Raw data:

1
Mon Apr 06 23:55:14 PDT 2009

2
Tue Apr 07 01:16:43 PDT 2009

3
Tue Apr 07 03:06:17 PDT 2009

I having a problem with doing the format with strptime since i having a data from a csv file.

dates <- read.csv(file = "Australia_timestamp.csv")
colnames(dates) <- c("Date")
format <- "%a %b %d %H:%M:%S %z %Y"
dates <-strptime(dates[], format = format)

Output:

NA

How can I solve this issue by not having NA and having a output with correct format?

CodePudding user response:

The NA you're getting is because %z does not capture PDT. It would fall under %Z as it is not numeric, but strptime does not support %Z for input as character abbreviations are not unambiguous. PDT is UTC-7, so to capture the timestamp using strptime, your strings would have to be, for example,

Tue Apr 07 03:06:17 -0700 2009

What you can do to solve the issue depends on the data you have. If all your timestamps are in PDT, you can simply replace PDT in all your strings by -0700:

dates$Date <- gsub("PDT", "-0700", dates$Date)
dates$Date <- strptime(dates$Date, format = format)

CodePudding user response:

1) The subject refers to a csv file but the data shown is not in csv form. We will assume that the form shown in the question is correct and the input file contains a first line of just 1, a second line describing a date/time, a third which is blank and so on as in the Note at the end.

We read it in line by line, and then prefix successive lines with num: , datetime: and an empty string. That will convert it to debian control format form (dcf) so we can use read.dcf to read it giving a character matrix. Convert that to a data frame, use convert.type to convert the first column to numeric and convert the second column to POSIXct.

The first two lines in the pipeline below are written to make use of Lines in the Note for reproducibility but to read the same from a file just replace the first two lines in the pipeline with something like this, "myfile" |> where myfile is the name of the input file.

No packages are used.

prefix <- \(x) paste(rep(c("num:", "time:", ""), length = length(x)), x)

Lines |>
  textConnection(name = "") |>
  readLines() |>
  trimws() |>
  prefix() |> 
  textConnection(name = "") |>
  read.dcf() |>
  as.data.frame() |>
  type.convert(as.is = TRUE) |>
  transform(time = as.POSIXct(time, format = "%a %b %d %H:%M:%S PDT %Y"))

giving this data.frame:

  num                time
1   1 2009-04-06 23:55:14
2   2 2009-04-07 01:16:43
3   3 2009-04-07 03:06:17

2) Another way to do this is to convert the data to a matrix as shown and then proceed as above.

Lines |>
  textConnection(name = "") |>
  read.table(sep = "?", strip.white = TRUE) |>
  unlist() |>
  matrix(ncol = 2, byrow = TRUE, dimnames = list(NULL, c("num", "time"))) |>
  as.data.frame() |>
  type.convert(as.is = TRUE) |>
  transform(time = as.POSIXct(time, format = "%a %b %d %H:%M:%S PDT %Y"))

3) If the format shown in the question is an error and it was meant that the 1, 2, 3, ... appears before the datetime on the same line like this then read it in line by line, replace the first space with a comma and read that into a data frame with the indicated column names. Finally convert the time column to POSIXct.

Lines2 <- "1 Mon Apr 06 23:55:14 PDT 2009
2 Tue Apr 07 01:16:43 PDT 2009
3 Tue Apr 07 03:06:17 PDT 2009"

Lines2 |>
  textConnection(name = "") |>
  readLines() |>
  trimws() |>
  sub(pattern = " ", replacement = ",") |>
  textConnection(name = "") |>
  read.table(sep = ",", col.names = c("num", "time"), strip.white = TRUE) |>
  transform(time = as.POSIXct(time, format = "%a %b %d %H:%M:%S PDT %Y"))

4) If the input is truly a csv file but with no column headers as shown in Lines3 then it is even easier:

Lines3 <- "1,Mon Apr 06 23:55:14 PDT 2009
2,Tue Apr 07 01:16:43 PDT 2009
3,Tue Apr 07 03:06:17 PDT 2009"

Lines3 |>
  textConnection(name = "") |>
  read.table(sep = ",", col.names = c("num", "time"), strip.white = TRUE) |>
  transform(time = as.POSIXct(time, format = "%a %b %d %H:%M:%S PDT %Y"))

Note

Lines <- "1
Mon Apr 06 23:55:14 PDT 2009

2
Tue Apr 07 01:16:43 PDT 2009

3
Tue Apr 07 03:06:17 PDT 2009"
  • Related