Set New Dataframe Column If Existing Columns Value Starts With A Number-CodePudding

How can I create a new column (like below) to separate out values if they start with a number? I've attempted utilizing variations of isdigit() and slicing the value to look at the first character [:1], but I haven't been able to get this to work. df.apply(lambda x: x if x['attr'][:1].isdigit()==False)

Dummy Data:

data = {'Name':['Bob','Kyle','Kevin'],
        'attr':['abc123','1230','(ab)']}

df = pd.DataFrame(data)

Desired Output:

data = {'Name':['Bob','Kyle','Kevin'],
        'attr':['abc123','1230','(ab)'],
        'num_start':[None,'1230',None],
         'str_start':['abc123',None,'(ab)']}

df = pd.DataFrame(data)

CodePudding user response：

Another possible solution:

cond = df['attr'].str.replace('[^\w\d]', '').str.contains(r'^\d')

df['num_start'] = np.where(cond, df['attr'], None)
df['str_start'] = np.where(cond, None, df['attr'])

Output:

    Name    attr num_start str_start
0    Bob  abc123      None    abc123
1   Kyle    1230      1230      None
2  Kevin    (ab)      None      (ab)

CodePudding user response：

Use a regex: `r'^\d'`

pandas.Series.str.startswith would work, except that it doesn't accept a regex; instead, use pandas.Series.str.contains and anchor to the start of the string with ^, then search for a numeric digit with \d.

df.attr.str.contains(r'^\d') gives a Series of True/False values; for rows where it is True, the value goes to the num_start column, and where it is False, the value goes to the str_start column.

From the first df,

condition = df.attr.str.contains(r'^\d')
df['num_start'] = df.attr.where(condition, other=None)
df['str_start'] = df.attr.where(~condition, other=None)

gives

    Name    attr num_start str_start
0    Bob  abc123      None    abc123
1   Kyle    1230      1230      None
2  Kevin    (ab)      None      (ab)

Postscript

If this is an x-y problem, where you want to handle the rows differently depending on whether the attr starts with a digit or no, consider something like

for starts_with_num, group in df.groupby(condition):
    # logic
    if starts_with_num:
        # do the thing (attr starts with a digit)
    else:
        # do the other thing (attr doesn't start with a digit)

Use a regex: r'^\d'

Postscript

Use a regex: `r'^\d'`