I am importing a very large fixed-width dataset into R and wish to use vroom for much better speed. However, the dates in this dataset are in numeric format with either 7 or 8 digits, depending on whether the day of the month has 1 or 2 digits (examples below).
#8 digit date (1985-03-21):
# 21031985
#7 digit date (1985-03-01):
# 1031985
I cannot see any way to specify this type of format using col_date(format = )
as one normally would. It is easy to make a function that converts these 7/8 digit numbers into dates, but doing that means materialising the imported data and removes the speed advantage that vroom provides.
I am looking for a way to have vroom interpret these numbers on its own, or a workaround that does not sacrifice vroom's speed.
Thanks very much for any help here.
CodePudding user response:
Those formats are horrible in general, but regardless I expect nothing in readr
is going to work right for you because of the 1 or 2 digit day-of-month. I suggest importing reading that column in as col_character
, then post-processing them with
vec <- c("21031985", "1031985")
as.Date(paste0(strrep("0", pmax(8 - nchar(vec), 0)), vec), format = "%d%m%Y")
# [1] "1985-03-21" "1985-03-01"
Quick walk-through:
8 - nchar(vec)
tells us how many0
s need to be padded to the left of each string. In this case, it should be0
and1
, respectively. This might be a problem if you have length 6 strings, only you know if that's an issue.strrep("0", ..)
repeats the0
string as many times as we need, includingstrrep("0", 0)
producing""
(no zeroes).pmax(.., 0)
is the defensive programmer, if there's a length-9 string in there, we cannot dostrrep("0", -1)
, we want to keep it from going negative.paste0(.., vec)
to do the actual padding.
From there, all strings should be normalized and able to be converted using "%d%m%Y"
.
CodePudding user response:
Your data has been messed up and converted to integer at some point (have you made sure you import it correctly into R?). Anyway if you have integer data you can do.
as.Date(sprintf("d", vec), format = "%d%m%Y")
# [1] "1985-03-21" "1985-03-01"
CodePudding user response:
Vroom can use a pipe as input. That means you can use a tool like awk
to fix the format (e.g. make it always 8 digit, which is eaasy with sprintf
). That way you can still benefit from vroom streaming the file. You could even use R - but if you are after performance, you need something that can process the file streaming and better be lightweight.
I used a test file test.csv
:
id,date,text
1,1022020,some
2,12042020,more
3,2012020,text
I could read it via (of course the awk call needs to be adjusted for your data - but essentially if you need to just adjust the column $2
means 2nd column, the ','
specifies the separator):
vroom(pipe("awk -F ',' 'BEGIN{OFS=\",\"}; NR==1{print}; NR!=1 {$2=sprintf(\"d\",$2);print;}' test.csv"),
col_types=cols(date=col_date(format='%d%m%Y'))
)
giving
# A tibble: 3 × 3
id date text
<int> <date> <chr>
1 1 2020-02-01 some
2 2 2020-04-12 more
3 3 2020-01-02 text