Count the number of elements in a list where the list contains the empty string-CodePudding

I'm having difficulties counting the number of elements in a list within a DataFrame's column. My problem comes from the fact that, after importing my input csv file, the rows that are supposed to contain an empty list [] are actually parsed as lists containing the empty string [""]. Here's a reproducible example to make things clearer:

import pandas as pd
df = pd.DataFrame({"ID": [1, 2, 3], "NETWORK": [[""], ["OPE", "GSR", "REP"], ["MER"]]})

print(df)
   ID          NETWORK
0   1               []
1   2  [OPE, GSR, REP]
2   3            [MER]

Even though one might think that the list for the row where ID = 1 is empty, it's not. It actually contains the empty string [""] which took me a long time to figure out.

So whatever standard method I try to use to calculate the number of elements within each list I get a wrong value of 1 for those who are supposed to be empty:

df["COUNT"] = df["NETWORK"].str.len()

print(df)
   ID          NETWORK  COUNT
0   1               []      1
1   2  [OPE, GSR, REP]      3
2   3            [MER]      1

I searched and tried a lot of things before posting here but I couldn't find a solution to what seems to be a very simple problem. I should also note that I'm looking for a solution that doesn't require me to modify my original input file nor modify the way I'm importing it.

CodePudding user response：

You just need to write a custom apply function that ignores the ''

df['COUNT'] = df['NETWORK'].apply(lambda x: sum(1 for w in x if w!=''))

CodePudding user response：

Another way:

df['NETWORK'].apply(lambda x: len([y for y in x if y]))

CodePudding user response：

Using apply is probably more straightforward. Alternatively, explode, filter, then group by count.

_s = df['NETWORK'].explode()
_s = _s[_s != '']
df['count'] = _s.groupby(level=0).count()

This yields:

            NETWORK  count
ID                        
1                []    NaN
2   [OPE, GSR, REP]    3.0
3             [MER]    1.0

Fill NA with zeroes if needed.

CodePudding user response：

df["COUNT"] = df["NETWORK"].apply(lambda x: len(x))

Use a lambda function on each row and in the lambda function return the length of the array