How can I create a new column (like below) to separate out values if they start with a number? I've attempted utilizing variations of isdigit() and slicing the value to look at the first character [:1], but I haven't been able to get this to work. df.apply(lambda x: x if x['attr'][:1].isdigit()==False)
Dummy Data:
data = {'Name':['Bob','Kyle','Kevin'],
'attr':['abc123','1230','(ab)']}
df = pd.DataFrame(data)
Desired Output:
data = {'Name':['Bob','Kyle','Kevin'],
'attr':['abc123','1230','(ab)'],
'num_start':[None,'1230',None],
'str_start':['abc123',None,'(ab)']}
df = pd.DataFrame(data)
CodePudding user response:
Another possible solution:
cond = df['attr'].str.replace('[^\w\d]', '').str.contains(r'^\d')
df['num_start'] = np.where(cond, df['attr'], None)
df['str_start'] = np.where(cond, None, df['attr'])
Output:
Name attr num_start str_start
0 Bob abc123 None abc123
1 Kyle 1230 1230 None
2 Kevin (ab) None (ab)
CodePudding user response:
Use a regex: r'^\d'
pandas.Series.str.startswith
would work, except that it doesn't accept a regex; instead, use pandas.Series.str.contains
and anchor to the start of the string with ^
, then search for a numeric digit with \d
.
df.attr.str.contains(r'^\d')
gives a Series
of True/False
values; for rows where it is True
, the value goes to the num_start
column, and where it is False
, the value goes to the str_start
column.
From the first df
,
condition = df.attr.str.contains(r'^\d')
df['num_start'] = df.attr.where(condition, other=None)
df['str_start'] = df.attr.where(~condition, other=None)
gives
Name attr num_start str_start
0 Bob abc123 None abc123
1 Kyle 1230 1230 None
2 Kevin (ab) None (ab)
Postscript
If this is an x-y problem, where you want to handle the rows differently depending on whether the attr
starts with a digit or no, consider something like
for starts_with_num, group in df.groupby(condition):
# logic
if starts_with_num:
# do the thing (attr starts with a digit)
else:
# do the other thing (attr doesn't start with a digit)