Home > Back-end >  Pattern extraction derived from existing column in Python
Pattern extraction derived from existing column in Python

Time:06-11

I have a dataset where I would like to create a new column derived from one of my existing columns. The new column is created by taking the values after the first colon up until, and not including the last colon.

Data

site        stat    crate   
JJ - site 1 ok      JD1:0055:AB5.30:100 
JJ - site 1 ok      JD1:00AB:AB5.30:111 
J1 - site 2 fail    JD1:00AA:AB5.30:200 
JJ - site 1 ok      JD1:00A:AB5.30:555  
CC - site 8 fail    CC5:01AAA:BA8.40:777    

Desired

site        stat    main_cr        crate
JJ - site 1 ok      0055:AB5.30    JD1:0055:AB5.30:100
JJ - site 1 ok      00AB:AB5.30    JD1:00AB:AB5.30:111
J1 - site 2 fail    00AA:AB5.30    JD1:00AA:AB5.30:200  
JJ - site 1 ok      00A:AB5.30     JD1:00A:AB5.30:555
CC - site 8 fail    01AAA:BA8.40   CC5:01AAA:BA8.40:777 

Doing

df['main_cr']=df['crate'].str.extract(r':(\d):.*):')

I've read the documentation but am a little stuck on the placement of the extraction steps Any suggestion is helpful thank you.

CodePudding user response:

I think you can get the result you describe like this:

df['main_cr']=df['crate'].str.extract(r':(.*):')

Because .* is greedy, it will match all the : characters except the first and last ones.

Input:

  site  stat                 crate
0    a    ok   JD1:0055:AB5.30:100
1    b    ok   JD1:00AB:AB5.30:111
2    c  fail   JD1:00AA:AB5.30:200
3    d    ok    JD1:00A:AB5.30:555
4    e  fail  CC5:01AAA:BA8.40:777

Output:

  site  stat                 crate       main_cr
0    a    ok   JD1:0055:AB5.30:100   0055:AB5.30
1    b    ok   JD1:00AB:AB5.30:111   00AB:AB5.30
2    c  fail   JD1:00AA:AB5.30:200   00AA:AB5.30
3    d    ok    JD1:00A:AB5.30:555    00A:AB5.30
4    e  fail  CC5:01AAA:BA8.40:777  01AAA:BA8.40

Full test code:

import pandas as pd
df = pd.DataFrame({
'site':['a','b','c','d','e'],
'stat':['ok','ok','fail','ok','fail'],
'crate':['JD1:0055:AB5.30:100','JD1:00AB:AB5.30:111','JD1:00AA:AB5.30:200','JD1:00A:AB5.30:555','CC5:01AAA:BA8.40:777']
})
df['main_cr']=df['crate'].str.extract(r':(.*):')

CodePudding user response:

You can consider using str.replace with the regex ^[^:] : ie deleting evertything from the begining until the first :

df['crate'].str.replace('^[^:] :|:[^:] $', '', regex = True)
0     0055:AB5.30
1     00AB:AB5.30
2     00AA:AB5.30
3      00A:AB5.30
4    01AAA:BA8.40
Name: crate, dtype: object
  • Related