How can I use regex to parse an OCC Option Symbol in R?-CodePudding

The OCC option symbol consists of 4 parts:

AAPL220121C00145000
TSLA220121C01050000
HD220121C00400000

Root symbol of the underlying stock or ETF, padded with spaces to 6 characters
Expiration date, 6 digits in the format "%Y-%m-%d"
Option type, either P or C, for put or call
Strike price, as the price x 1000, front padded with 0s to 8 digits

As an example AAPL220121C00145000 means a call on AAPL, expiring on 01/21/2022, with a strike price of $145.00.

I want to put these OCCs into data.table as follows

Symbol   ExpDate    Type   Strike
AAPL     2022-01-21   call    145
TSLA     2022-01-21   call   1050
HD       2022-01-21   call    400

I tried the following but there seems to be some problem -

library(data.table)
read.table(text=gsub("(?=[A-Za-z])((\\d{2})(\\d{2})(\\d{2}))([PC])(\\d{8})", "\\1,\\2,\\3,\\4", "AAPL220121C00145000"),
colClasses = c("character", "date", "character", "numeric"), sep=",", stringsAsFactors = FALSE)

Can someone please help create a data.table in the format above using only regex?

CodePudding user response：

We can use strcapture from base R

type.convert(strcapture("([A-Za-z] )(\\d{2}\\d{2}\\d{2})([PC])(\\d{8})", 
   "AAPL220121C00145000", proto = list(Symbol = "character", 
     ExpDate = "date", Type = "character", Strike = "numeric")), as.is = TRUE)
  Symbol ExpDate Type Strike
1   AAPL  220121    C 145000

CodePudding user response：

    library(dplyr)

df %>% mutate(stock = substring(option, 1, 4)) %>% 
mutate(Strike = as.integer(substring(option, 12, 16))) %>% 
mutate(Type = ifelse(substring(option, 11, 11) == 'C', 'CALL', 'PUT')) %>%
mutate(ExpDate = ymd(substring(option, 5, 10))) 
               option stock Strike Type    ExpDate
1 AAPL220121C00145000  AAPL    145 CALL 2022-01-21
2 TSLA220121C01050000  TSLA   1050 CALL 2022-01-21

CodePudding user response：

Your questing spans multiple problems. I focused on an easy way to split the strings into fixed widt parts.

You will probabaly be able to solve the rest from there on.

library(iotools)
v <- c(4,6,1,8)
iotools::dstrfw(
  mystrings, 
  col_types = c("character", "character", "character", "numeric"), 
  widths = v)

#     V1     V2 V3      V4
# 1 AAPL 220121  C  145000
# 2 TSLA 220121  C 1050000

The date part cannot be handled in 1 go, since iotools only supports the %Y-%m-%d format of dates (using the fasttime package). The numeric part can easily be divided by 1000 though, since it already is read in as numeric (notice the leading zero's are stripped).