count number of string per row in a column with pandas-CodePudding

Inside my dataframe:

no    pre_code
1     23, 234, 345
2     234, 345
3     23
4     NaN

I want to count number of string inside pre_code columns, What I have tried so far was:

df['count'] = df['pre_code'].astype('str').str.split(',').str.len().fillna(0)

but with the code above, it counts NaN as 1. So, I dont get the desired results.

Before, I also have tried this way:

df['count'] = df['pre_code'].str.count(',').add(1).fillna(0)

Unfortunately, the code above also did not work on my dataframe. It gives me 0 for the single entry string. For your information, I have 2200 rows on my dataframe, and somehow the code could not work perfectly for those number of rows. When I tried for only 5 rows, somehow it worked well.

I expect the result would be like:

no    pre_code         count
1     23, 234, 345       3
2     234, 345           2
3     23                 1
4     NaN                0

any solution for my case?

thanks in advance.

CodePudding user response：

I think you need nan like np.nan instead string nan, then both solutions working correct:

You need test how looks values without numbers for replacement:

print (df.loc[~df['pre_code'].str.contains('\d'), 'pre_code'].unique().tolist())
['nan']

df['count'] = df['pre_code'].replace('nan', np.nan).str.split(',').str.len().fillna(0)

Or:

df['count'] = df['pre_code'].replace('nan', np.nan).str.count(',').add(1).fillna(0)

print (df)
   no      pre_code  count
0   1  23, 234, 345    3.0
1   2      234, 345    2.0
2   3            23    1.0
3   4           NaN    0.0

EDIT:

EDIT: More general solution is convert values without numbers to NaN in Series.where with Series.str.contains:

df['count'] = (df['pre_code'].where(df['pre_code'].str.contains('\d', na=False))
                             .str.count(',')
                             .add(1)
                             .fillna(0)
                             .astype(int))
print (df)
   no      pre_code  count
0   1  23, 234, 345      3
1   2      234, 345      2
2   3            23      1
3   4           NaN      0

CodePudding user response：

Try:

df['count'] = df.loc[df['pre_code'].notna(), 'pre_code'] \
                .astype(str).str.split(',').str.len() \
                .reindex(df.index, fill_value=0)

print(df)

# Output:
   no      pre_code  count
0   1  23, 234, 345      3
1   2      234, 345      2
2   3            23      1
3   4           NaN      0

I'm not sure you have to convert to str (`astype(str)).