Pandas: Why does the length of an empty list equal 1?-CodePudding

In the example DataFrame, why is the length of an empty list 1? I'd expect an empty list to be of length 0; as len([]) == 0.

Use case:

I'm trying to count the number of values in each row, where the values are a string of comma separated integers, or alpha-numeric.

Example:

Create the sample dataset:

import pandas as pd

df = pd.DataFrame({'col1': ['1,2,3,4', '1,2,3', '1,2', '1A, 363C', 
                   '1,1-33', '26a, Green House', '** All **', '', '']})

df['col1']

0             1,2,3,4
1               1,2,3
2                 1,2
3            1A, 363C
4              1,1-33
5    26a, Green House
6           ** All **
7                    
8                    
Name: col1, dtype: object

Split the string on comma to create lists of values:

df['col1'].str.split(',')

0           [1, 2, 3, 4]
1              [1, 2, 3]
2                 [1, 2]
3            [1A,  363C]
4              [1, 1-33]
5    [26a,  Green House]
6            [** All **]
7                     []
8                     []
Name: col1, dtype: object

Try and determine the length of each list:

df['col1'].str.split(',').map(len)

0    4
1    3
2    2
3    2
4    2
5    2
6    1
7    1  <-- Expedted to be 0
8    1  <-- Expected to be 0
Name: col1, dtype: int64

Questions:

Why is the length of an empty list 1?
- As pointed out by @Timus, using .map(repr) shows the list isn't empty: ['']. Thank you.
What would be a better approach for this use-case?

CodePudding user response：

We can try str.count

df['count'] = df['col1'].str.count(r'[^,] ')

      col1  count
0  1,2,3,4      4
1    1,2,3      3
2      1,2      2
3       1A      1
4               0

CodePudding user response：

The last one has the empty string.

>>> ''.split(',')
['']

CodePudding user response：

If you want to count the empty strings as 0 you can mask them:

df['col1'].str.split(',').str.len().mask(df['col1'].eq(''),0)

Note however that split len is not the most straightforward approach. You can just count the separators (,). Then add 1 wherever the string is not empty:

df['col1'].str.count(',').add(df['col1'].ne(''))

Output:

0    4
1    3
2    2
3    1
4    0
Name: col1, dtype: int64

CodePudding user response：

Thank you @Timus for the insight to use .map(repr) to reveal the non-empty list as [''].

Solution:

Replace all empty string values with NaN:

df['col1'].replace('', float('nan'), inplace=True)

Apply a lambda statement to split and count, if the value is not a float:

df['count'] = df['col1'].apply(lambda x: len(x.split(',')) if not isinstance(x, float) else 0)

Result:

    col1    count
0   1,2,3,4     4
1   1,2,3       3
2   1,2         2
3   1A          1
4   NaN         0