I have a table like that:
Column1 | Column2 | Column3 | Column4 | Column5 |
---|---|---|---|---|
100 | John | [-, 1] | [brown, yellow] | [nan, nan] |
200 | Stefan | [nan, 2] | [nan, yellow] | [-, accepted] |
As you can see Columns 3-5 are made of lists entirely and what I want is to remove dash (-) along with "nan" elements from the lists in those columns.
So the output should look like this at the end:
Column1 | Column2 | Column3 | Column4 | Column5 |
---|---|---|---|---|
100 | John | [1] | [brown, yellow] | [] |
200 | Stefan | [2] | [yellow] | [accepted] |
The closest to this outcome I was able to get with the following function:
Table1["Column3"] = Table1["Column3"].apply(lambda x: [el for el in x if el != '-' if pd.isnull(el) == False])
But the problem with it is that I don't know how to apply it for all the columns that are made out of lists in the DataFrame. This is simplified example, in the original I have nearly 15 columns and was wondering if there is a way to achieve it, instead of writing a function like that separately for all 15 columns.
CodePudding user response:
Another solution:
for c in df.columns:
df[c] = df[c].apply(
lambda x: [v for v in x if v != "-" and pd.notna(v)]
if isinstance(x, list)
else x
)
print(df)
Prints:
Column1 Column2 Column3 Column4 Column5
0 100 John [1] [brown, yellow] []
1 200 Stefan [2] [yellow] [accepted]
CodePudding user response:
Try this
# data
df = pd.DataFrame({'Column1': [100, 200],
'Column2': ['John', 'Stefan'],
'Column3': [['-', 1], [np.nan, 2]],
'Column4': [['brown', 'yellow'], [np.nan, 'yellow']],
'Column5': [[np.nan, np.nan], ['-', 'accepted']]})
# stack and explode to get the list elements out of lists
exp_df = df.set_index(['Column1', 'Column2']).stack().explode()
# mask that filters out dash and nans
m = exp_df.ne('-') & exp_df.notna()
# after using m, aggregate back to lists
exp_df[m].groupby(level=[0,1,2]).agg(list).unstack(fill_value=[]).reset_index()
CodePudding user response:
If I understand your objectives correctly. Here is how I will approach it.
import pandas as pd
import numpy as np
from typing import Any
### 1. replicate your dataframe. nan here is from np. not sure what nan in your df is.
df = pd.DataFrame({
'col_1':[100,200],
'col_2':['John','Stefan'],
'col_3':[['-', 1],[np.nan,2]],
'col_4':[['brown', 'yellow'],[np.nan, 'yellow']]
})
### 2: remove funciton: this function will remove dash and np.nan from each cell for a selected cols once applied
def remove(element: Any) -> Any:
try:
return [x for x in element if x not in [ '-', np.nan]]
except TypeError: # in case some cell value is not a list
return element
### 3: detect_col_element_as_list: this function will detect if a given col has any cell composed by list. if so return True
def detect_col_element_as_list(element: pd.Series) -> bool:
return any(isinstance(x, list) for x in element)
### 4: first get all cols that have cells as list
cols_contain_list = [col for col in df.columns if detect_col_element_as_list(df[col])]
### 5: a for loop to apply remove function to all cols that has list as cell value
for col in cols_contain_list:
df[col] = df[col].apply(lambda x: remove(x))
Let me know if this is something you want.