#python My data frame has one column:
User name
XYZ (R1)
ABCD (R2)
LKMNP (R2)
I want to find the position of R1, R2 and R2 in the data frame and then remove them. The output should be:
User name extract Position New name
XYZ (R1) R1 5 XYZ
ABCD (R2) R2 6 ABCD
LK (R2) R2 4 LK
CodePudding user response:
You can use str.split
or str.extract
:
df['New name'] = df['User name'].str.split(r' (', regex=False).str[0]
# OR
df['New name'] = df['User name'].str.extract(r'([^ \(] )')
print(df)
# Output
User name New name
0 XYZ (R1) XYZ
1 ABCD (R2) ABCD
2 LKMNP (R2) LKMNP
Update
The regex condition is : it start with parenthesis ( has the word R has a number close the parenthesis )
df[['new name', 'extract', 'position']] = (
df['User name'].str.extract(r'([^\s\(] ) \((R\d )\)')
.assign(pos=lambda x: x[0].str.len() 2)
)
Output:
>>> df
User name new name extract position
0 XYZ (R1) XYZ R1 5
1 ABCD (R2) ABCD R2 6
2 LKMNP (R2) LKMNP R2 7
CodePudding user response:
IIUC, you can use:
# extract 3 parts
df[['new name', 'position', 'extract']] = df['User name'].str.extract(r'(\S )(\s \()([^\)] )')
# use first 2 to compute the Rx position
df['position'] = (df[['new name', 'position']]
.stack().str.len()
.groupby(level=0).sum()
)
Output:
User name new name extract position
0 XYZ (R1) XYZ R1 5
1 ABCD (R2) ABCD R2 6
2 LKMNP (R2) LKMNP R2 7
Regex:
(\S ) # group 1: non-space char(s)
(\s \() # group 2: space(s) and literal "("
([^\)] ) # group 3: non ")" char (s)
CodePudding user response:
You can use map
with a function based on re
to get what you want
import pandas as pd
import re
CUSTOM_REGEX = re.compile(r'^(.*)(\(.*\))(.*)')
def find_and_delete_sub(s):
c = CUSTOM_REGEX.match(s)
if c is not None:
return c.span(2)[0] 1, c.group(1) c.group(3)
return None, None
pd.concat([df,
pd.DataFrame(
df['User Name'].map(find_and_delete_sub).tolist(),
columns=["position", 'New Name'])],
axis=1)
Supposing that your input dataframe df
is:
df = pd.DataFrame(
data=[
['XYZ (R1)'],
['ABCD (R2)'],
['LK (R2)'],
['LKMN ('],
],
columns=['User Name'],
)
you get:
User Name position New Name
0 XYZ (R1) 4.0 XYZ
1 ABCD (R2) 5.0 ABCD
2 LK (R2) 3.0 LK
3 LKMNP ( NaN None