Home > Back-end >  Efficient way to set all empty lists in column to None in Pandas
Efficient way to set all empty lists in column to None in Pandas

Time:11-08

In pandas, it is possible to do the following:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99919 entries, 0 to 99918
Data columns (total 47 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   reason                        99919 non-null  object   

Most of these entries contain [], and some contain an actual object, e.g. [{'a':'x', 'b':'y', 'c':'z'}]

I want to efficiently (without using df.apply) set all entries that contain only the empty list to e.g. None.

I thought that I might be able to do something like the following:

df[df['reasons'].str.len() == 0]['reasons'] = None

but this does not work since it is setting it on a copy of the dataframe (see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy)

Usually in the case where I set something on df[row][column], I can instead do:

df.loc[:, (row, column)]

to set it on the actual dataframe, but that does not work in this case since df['reasons'].str.len() == 0 returns a series which is unhashable and not a valid argument for loc.

Is there any way to do this without using apply?

CodePudding user response:

The syntax should be:

df.loc[df['reasons'].str.len() == 0, 'reasons'] = None

The correct use of loc is loc[row, col], not loc[:, (row, col)]

df.loc[:, (X, Y)] can be used if you have a MultiIndex (see below for an example), but this is not the case here.

df = pd.DataFrame(None, index=range(2),
                  columns=pd.MultiIndex.from_product([['A', 'B'], [1, 2]]))
df.loc[:, ('A', 1)]

CodePudding user response:

example:

df = pd.DataFrame([[2, []], [1, None], [1,3]], columns=['col1', 'col2'])

output(df)

    col1    col2
0   2       []
1   1       None
2   1       3
  1. when convert to bool dtype, empty list convert to False
  2. where make False to NaN

df.where(df.astype('bool'))

output:

    col1    col2
0   2       NaN
1   1       NaN
2   1       3

if chk only col2 column

df['col2'].where(df['col2'].astype('bool'))
  • Related