Home > OS >  How do I select columns only if the rows are in type list?
How do I select columns only if the rows are in type list?

Time:03-17

My goal is to write a loop to iterate through a DataFrame's columns and only select if the column is type list. My table looks like this:

a b c
0 a ['bb', 'cc'] d
1 z ['b', 'c'] 3

My code looks like this, but does not work.

df = pd.DataFrame([['a', ['bb', 'cc'], 'd'], ['z', ['b', 'c'], '3']], columns = ['a', 'b', 'c'])

df_list = [col for col in df.columns if type(list) in col]

desired output is:

b
0 ['bb', 'cc']
1 ['b', 'c']

CodePudding user response:

df_list = [i for i in df.columns if len(pd.DataFrame(df[i].to_list()).T) > 1]
df[df_list]

Output:

    b
0   [bb, cc]
1   [b, c]

if you make no list column to dataframe after to_list, we can get n X 1 dataframe.

so we can dataframe by chk len(df.T) > 1

CodePudding user response:

There is not pandas way to check this. You would need to use pure python.

If you can rely on testing only the first row, this should be efficient:

mask = df.iloc[0].apply(lambda x: isinstance(x, list))

df.loc[:, mask]

If you need to test all cells, use applymap and all (or any if a single list is sufficient to select a column). Note that this might be slow on large dataframes.

mask = df.applymap(lambda x: isinstance(x, list)).all()

df.loc[:, mask]

Output:

          b
0  [bb, cc]
1    [b, c]

CodePudding user response:

You didn't specify if all the rows of a specific column in your DataFrame are of the same type. Looking at column c, it seems like they might be of different types. In that case, do you want columns that have ALL rows as a list or columns that have ANY of its rows as a list?

In either case, you can use boolean indexing to filter the database as follows.

To find out the columns that have a list in all rows:

status = (df.applymap(type).astype(str) == "<class 'list'>").all()

Or, to find out the columns that have a list in any of its rows:

status = (df.applymap(type).astype(str) == "<class 'list'>").any()

Afterwards you can obtain the result by:

target_columns = list((status.loc[status == True]).index)
df = df[target_columns]
  • Related