I have a dataset where I would like to create a new column derived from one of my existing columns. The column is created by extracting the values between the first and last colons.
Data
site stat crate
AA - site 1 ok AD1:00:AB5.30:100
AA - site 1 ok AD1:00:AB5.30:111
A1 - site 2 fail AD1:00:AB5.30:200
AA - site 1 ok AD1:00:AB5.30:555
BB - site 8 fail BB5:01:BA8.40:777
Desired
site stat main_cr crate
AA - site 1 ok 00:AB5.30 AD1:00:AB5.30:100
AA - site 1 ok 00:AB5.30 AD1:00:AB5.30:111
A1 - site 2 fail 00:AB5.30 AD1:00:AB5.30:200
AA - site 1 ok 00:AB5.30 AD1:00:AB5.30:555
BB - site 8 fail 01:BA8.40 BB5:01:BA8.40:777
Doing
My approach is to use some form of regex or split.
df['main_cr'] = df['crate'].str.split(':').str[1:3]
Above is not working, as it provides the result
[00, AB5.30]
I would like to create a new column by extracting the values between the first and last colons of an existing column within my dataframe.
Any suggestion is helpful thank you
CodePudding user response:
use pd.extract, and extract based on the regex pattern
df['main_cr']=df['crate'].str.extract(r':(\d{2}:.*):')
df
regex: it matches the first occurrence of ":" followed by two digits, specified as {2}, followed by colon ":", and then any number of characters before the subsequent occurrence of colon ":". the expression in b/w the parenthesis is extract out.
site stat crate main_cr
AA - site 1 ok AD1:00:AB5.30:100 00:AB5.30
AA - site 1 ok AD1:00:AB5.30:111 00:AB5.30
A1 - site 2 fail AD1:00:AB5.30:200 00:AB5.30
AA - site 1 ok AD1:00:AB5.30:555 00:AB5.30
BB - site 8 fail BB5:01:BA8.40:777 01:BA8.40