Home > database >  How can I find the position of a substring (list values) in a data frame column
How can I find the position of a substring (list values) in a data frame column

Time:07-05

#python My data frame has one column:

User name
XYZ (R1)
ABCD (R2) 
LKMNP (R2)

I want to find the position of R1, R2 and R2 in the data frame and then remove them. The output should be:

User name   extract  Position  New name      
XYZ (R1)    R1       5         XYZ
ABCD (R2)   R2       6         ABCD
LK (R2)     R2       4         LK  

CodePudding user response:

You can use str.split or str.extract:

df['New name'] = df['User name'].str.split(r' (', regex=False).str[0]
# OR
df['New name'] = df['User name'].str.extract(r'([^ \(] )')
print(df)

# Output
    User name New name
0    XYZ (R1)      XYZ
1   ABCD (R2)     ABCD
2  LKMNP (R2)    LKMNP

Update

The regex condition is : it start with parenthesis ( has the word R has a number close the parenthesis )

df[['new name', 'extract', 'position']] = (
    df['User name'].str.extract(r'([^\s\(] ) \((R\d )\)')
                   .assign(pos=lambda x: x[0].str.len()   2)
)

Output:

>>> df
    User name new name extract  position
0    XYZ (R1)      XYZ      R1         5
1   ABCD (R2)     ABCD      R2         6
2  LKMNP (R2)    LKMNP      R2         7

CodePudding user response:

IIUC, you can use:

# extract 3 parts
df[['new name', 'position', 'extract']] = df['User name'].str.extract(r'(\S )(\s \()([^\)] )')

# use first 2 to compute the Rx position
df['position'] = (df[['new name', 'position']]
                  .stack().str.len()
                  .groupby(level=0).sum()
                  )

Output:

    User name  new name  extract  position
0    XYZ (R1)       XYZ       R1         5
1   ABCD (R2)      ABCD       R2         6
2  LKMNP (R2)     LKMNP       R2         7

Regex:

(\S )      # group 1: non-space char(s)
(\s \()    # group 2: space(s) and literal "("
([^\)] )   # group 3: non ")" char (s)

CodePudding user response:

You can use map with a function based on re to get what you want

import pandas as pd
import re

CUSTOM_REGEX = re.compile(r'^(.*)(\(.*\))(.*)')

def find_and_delete_sub(s):
    c = CUSTOM_REGEX.match(s)
    if c is not None:
        return c.span(2)[0]   1, c.group(1)   c.group(3)
    return None, None

pd.concat([df,
           pd.DataFrame(
             df['User Name'].map(find_and_delete_sub).tolist(),
             columns=["position", 'New Name'])], 
          axis=1)

Supposing that your input dataframe df is:

df = pd.DataFrame(
    data=[
        ['XYZ (R1)'],
        ['ABCD (R2)'],
        ['LK (R2)'],
        ['LKMN ('],
    ],
    columns=['User Name'],
)

you get:

   User Name  position New Name
0   XYZ (R1)       4.0     XYZ 
1  ABCD (R2)       5.0    ABCD 
2    LK (R2)       3.0      LK 
3    LKMNP (       NaN     None
  • Related