Remove certain elements of a list for multiple columns in a DataFrame-CodePudding

I have a table like that:

Column1	Column2	Column3	Column4	Column5
100	John	[-, 1]	[brown, yellow]	[nan, nan]
200	Stefan	[nan, 2]	[nan, yellow]	[-, accepted]

As you can see Columns 3-5 are made of lists entirely and what I want is to remove dash (-) along with "nan" elements from the lists in those columns.

So the output should look like this at the end:

Column1	Column2	Column3	Column4	Column5
100	John	[1]	[brown, yellow]	[]
200	Stefan	[2]	[yellow]	[accepted]

The closest to this outcome I was able to get with the following function:

Table1["Column3"] = Table1["Column3"].apply(lambda x: [el for el in x if el != '-' if pd.isnull(el) == False])

But the problem with it is that I don't know how to apply it for all the columns that are made out of lists in the DataFrame. This is simplified example, in the original I have nearly 15 columns and was wondering if there is a way to achieve it, instead of writing a function like that separately for all 15 columns.

CodePudding user response：

Another solution:

for c in df.columns:
    df[c] = df[c].apply(
        lambda x: [v for v in x if v != "-" and pd.notna(v)]
        if isinstance(x, list)
        else x
    )

print(df)

Prints:

   Column1 Column2 Column3          Column4     Column5
0      100    John     [1]  [brown, yellow]          []
1      200  Stefan     [2]         [yellow]  [accepted]

CodePudding user response：

Try this

# data
df = pd.DataFrame({'Column1': [100, 200],
                   'Column2': ['John', 'Stefan'],
                   'Column3': [['-', 1], [np.nan, 2]],
                   'Column4': [['brown', 'yellow'], [np.nan, 'yellow']],
                   'Column5': [[np.nan, np.nan], ['-', 'accepted']]})

# stack and explode to get the list elements out of lists
exp_df = df.set_index(['Column1', 'Column2']).stack().explode()
# mask that filters out dash and nans
m = exp_df.ne('-') & exp_df.notna()
# after using m, aggregate back to lists
exp_df[m].groupby(level=[0,1,2]).agg(list).unstack(fill_value=[]).reset_index()

CodePudding user response：

If I understand your objectives correctly. Here is how I will approach it.

import pandas as pd
import numpy as np
from typing import Any

### 1. replicate your dataframe. nan here is from np. not sure what nan in your df is. 
df = pd.DataFrame({
    'col_1':[100,200],
    'col_2':['John','Stefan'],
    'col_3':[['-', 1],[np.nan,2]],
    'col_4':[['brown', 'yellow'],[np.nan, 'yellow']]
})

### 2: remove funciton: this function will remove dash and np.nan from each cell for a selected cols once applied
def remove(element: Any) -> Any:
    try: 
        return [x for x in element if x not in [ '-', np.nan]]
    except TypeError: # in case some cell value is not a list
        return element

### 3: detect_col_element_as_list: this function will detect if a given col has any cell composed by list. if so return True
def detect_col_element_as_list(element: pd.Series) -> bool:
    return any(isinstance(x, list) for x in element)

### 4: first get all cols that have cells as list 
cols_contain_list = [col for col in df.columns if detect_col_element_as_list(df[col])]

### 5: a for loop to apply remove function to all cols that has list as cell value
for col in cols_contain_list:
    df[col] = df[col].apply(lambda x: remove(x))

Let me know if this is something you want.