The OCC option symbol consists of 4 parts:
AAPL220121C00145000
TSLA220121C01050000
HD220121C00400000
- Root symbol of the underlying stock or ETF, padded with spaces to 6 characters
- Expiration date, 6 digits in the format
"%Y-%m-%d"
- Option type, either P or C, for put or call
- Strike price, as the price x 1000, front padded with 0s to 8 digits
As an example AAPL220121C00145000
means a call
on AAPL
, expiring on 01/21/2022
, with a strike price of $145.00
.
I want to put these OCCs into data.table as follows
Symbol ExpDate Type Strike
AAPL 2022-01-21 call 145
TSLA 2022-01-21 call 1050
HD 2022-01-21 call 400
I tried the following but there seems to be some problem -
library(data.table)
read.table(text=gsub("(?=[A-Za-z])((\\d{2})(\\d{2})(\\d{2}))([PC])(\\d{8})", "\\1,\\2,\\3,\\4", "AAPL220121C00145000"),
colClasses = c("character", "date", "character", "numeric"), sep=",", stringsAsFactors = FALSE)
Can someone please help create a data.table in the format above using only regex?
CodePudding user response:
We can use strcapture
from base R
type.convert(strcapture("([A-Za-z] )(\\d{2}\\d{2}\\d{2})([PC])(\\d{8})",
"AAPL220121C00145000", proto = list(Symbol = "character",
ExpDate = "date", Type = "character", Strike = "numeric")), as.is = TRUE)
Symbol ExpDate Type Strike
1 AAPL 220121 C 145000
CodePudding user response:
library(dplyr)
df %>% mutate(stock = substring(option, 1, 4)) %>%
mutate(Strike = as.integer(substring(option, 12, 16))) %>%
mutate(Type = ifelse(substring(option, 11, 11) == 'C', 'CALL', 'PUT')) %>%
mutate(ExpDate = ymd(substring(option, 5, 10)))
option stock Strike Type ExpDate
1 AAPL220121C00145000 AAPL 145 CALL 2022-01-21
2 TSLA220121C01050000 TSLA 1050 CALL 2022-01-21
CodePudding user response:
Your questing spans multiple problems. I focused on an easy way to split the strings into fixed widt parts.
You will probabaly be able to solve the rest from there on.
library(iotools)
v <- c(4,6,1,8)
iotools::dstrfw(
mystrings,
col_types = c("character", "character", "character", "numeric"),
widths = v)
# V1 V2 V3 V4
# 1 AAPL 220121 C 145000
# 2 TSLA 220121 C 1050000
The date part cannot be handled in 1 go, since iotools only supports the %Y-%m-%d format of dates (using the fasttime package). The numeric part can easily be divided by 1000 though, since it already is read in as numeric (notice the leading zero's are stripped).