Inside my dataframe:
no pre_code
1 23, 234, 345
2 234, 345
3 23
4 NaN
I want to count number of string inside pre_code
columns, What I have tried so far was:
df['count'] = df['pre_code'].astype('str').str.split(',').str.len().fillna(0)
but with the code above, it counts NaN as 1. So, I dont get the desired results.
Before, I also have tried this way:
df['count'] = df['pre_code'].str.count(',').add(1).fillna(0)
Unfortunately, the code above also did not work on my dataframe. It gives me 0 for the single entry string. For your information, I have 2200 rows on my dataframe, and somehow the code could not work perfectly for those number of rows. When I tried for only 5 rows, somehow it worked well.
I expect the result would be like:
no pre_code count
1 23, 234, 345 3
2 234, 345 2
3 23 1
4 NaN 0
any solution for my case?
thanks in advance.
CodePudding user response:
I think you need nan
like np.nan
instead string nan
, then both solutions working correct:
You need test how looks values without numbers for replacement:
print (df.loc[~df['pre_code'].str.contains('\d'), 'pre_code'].unique().tolist())
['nan']
df['count'] = df['pre_code'].replace('nan', np.nan).str.split(',').str.len().fillna(0)
Or:
df['count'] = df['pre_code'].replace('nan', np.nan).str.count(',').add(1).fillna(0)
print (df)
no pre_code count
0 1 23, 234, 345 3.0
1 2 234, 345 2.0
2 3 23 1.0
3 4 NaN 0.0
EDIT:
EDIT: More general solution is convert values without numbers to NaN
in Series.where
with Series.str.contains
:
df['count'] = (df['pre_code'].where(df['pre_code'].str.contains('\d', na=False))
.str.count(',')
.add(1)
.fillna(0)
.astype(int))
print (df)
no pre_code count
0 1 23, 234, 345 3
1 2 234, 345 2
2 3 23 1
3 4 NaN 0
CodePudding user response:
Try:
df['count'] = df.loc[df['pre_code'].notna(), 'pre_code'] \
.astype(str).str.split(',').str.len() \
.reindex(df.index, fill_value=0)
print(df)
# Output:
no pre_code count
0 1 23, 234, 345 3
1 2 234, 345 2
2 3 23 1
3 4 NaN 0
I'm not sure you have to convert to str (`astype(str)).