In the example DataFrame, why is the length of an empty list 1? I'd expect an empty list to be of length 0; as len([]) == 0
.
Use case:
I'm trying to count the number of values in each row, where the values are a string of comma separated integers, or alpha-numeric.
Example:
Create the sample dataset:
import pandas as pd
df = pd.DataFrame({'col1': ['1,2,3,4', '1,2,3', '1,2', '1A, 363C',
'1,1-33', '26a, Green House', '** All **', '', '']})
df['col1']
0 1,2,3,4
1 1,2,3
2 1,2
3 1A, 363C
4 1,1-33
5 26a, Green House
6 ** All **
7
8
Name: col1, dtype: object
Split the string on comma to create lists of values:
df['col1'].str.split(',')
0 [1, 2, 3, 4]
1 [1, 2, 3]
2 [1, 2]
3 [1A, 363C]
4 [1, 1-33]
5 [26a, Green House]
6 [** All **]
7 []
8 []
Name: col1, dtype: object
Try and determine the length of each list:
df['col1'].str.split(',').map(len)
0 4
1 3
2 2
3 2
4 2
5 2
6 1
7 1 <-- Expedted to be 0
8 1 <-- Expected to be 0
Name: col1, dtype: int64
Questions:
- Why is the length of an empty list 1?
- As pointed out by @Timus, using
.map(repr)
shows the list isn't empty:['']
. Thank you.
- As pointed out by @Timus, using
- What would be a better approach for this use-case?
CodePudding user response:
We can try str.count
df['count'] = df['col1'].str.count(r'[^,] ')
col1 count
0 1,2,3,4 4
1 1,2,3 3
2 1,2 2
3 1A 1
4 0
CodePudding user response:
The last one has the empty string.
>>> ''.split(',')
['']
CodePudding user response:
If you want to count the empty strings as 0 you can mask them:
df['col1'].str.split(',').str.len().mask(df['col1'].eq(''),0)
Note however that split
len
is not the most straightforward approach. You can just count the separators (,
). Then add 1 wherever the string is not empty:
df['col1'].str.count(',').add(df['col1'].ne(''))
Output:
0 4
1 3
2 2
3 1
4 0
Name: col1, dtype: int64
CodePudding user response:
Thank you @Timus for the insight to use .map(repr)
to reveal the non-empty list as ['']
.
Solution:
Replace all empty string values with NaN
:
df['col1'].replace('', float('nan'), inplace=True)
Apply a lambda statement to split and count, if the value is not a float
:
df['count'] = df['col1'].apply(lambda x: len(x.split(',')) if not isinstance(x, float) else 0)
Result:
col1 count
0 1,2,3,4 4
1 1,2,3 3
2 1,2 2
3 1A 1
4 NaN 0