I have a df with structure similar to below data with more cols and more rows. How to get the expected result. My code has an error - I interpret that the list cannot be saved in df cell? How to avoid loops if possible?
data = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [1, 0, 1], [0, 0, 0]]
df = pd.DataFrame(data, columns=["choice_a", "choice_b", "choice_c"])
Expected result
choice_a choice_b choice_c choices
index
0 0 0 1 ['c']
1 0 1 0 ['b']
2 1 0 0 ['a']
3 1. 0. 1 ['a','b']
4 0. 0. 0 NA
My code
df['choices']=0
for i in np.arange(df.shape[0]):
choice_list = []
for j in np.arange(len(df.columns)):
if df.iloc[i,j]==1:
choice_list.append(df.columns[j].split('_')[1])
df.iloc[i,4]=choice_list
Error I am getting
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/65/3mqr9fpn37jf2xt2pxbcgp_w0000gn/T/ipykernel_1513/2279334138.py in <module>
5 if main_dataset.iloc[i,j]==1:
6 choice_list.append(main_dataset.columns[j].split('_')[1])
----> 7 main_dataset.iloc[i,5]=choice_list
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
714
715 iloc = self if self.name == "iloc" else self.obj.iloc
--> 716 iloc._setitem_with_indexer(indexer, value, self.name)
717
718 def _validate_key(self, key, axis: int):
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
1689 if take_split_path:
1690 # We have to operate column-wise
-> 1691 self._setitem_with_indexer_split_path(indexer, value, name)
1692 else:
1693 self._setitem_single_block(indexer, value, name)
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _setitem_with_indexer_split_path(self, indexer, value, name)
1744 return self._setitem_with_indexer((pi, info_axis[0]), value[0])
1745
-> 1746 raise ValueError(
1747 "Must have equal len keys and value "
1748 "when setting with an iterable"
ValueError: Must have equal len keys and value when setting with an iterable
CodePudding user response:
Use list comprehension with filter columns names splitted by _
for new column, last if necessary replace empty lists to missing values add Series.where
:
cols = df.columns.str.split('_').str[1].to_numpy()
df['choices'] = [list(cols[x == 1]) for x in df.to_numpy()]
df['choices'] = df['choices'].where(df['choices'].astype(bool))
print (df)
choice_a choice_b choice_c choices
0 0 0 1 [c]
1 0 1 0 [b]
2 1 0 0 [a]
3 1 0 1 [a, c]
4 0 0 0 NaN
CodePudding user response:
Another possible solution:
df['choices'] = df.apply(lambda x: df.columns[x == 1].str.replace(
r'^.*_', '', regex=True).values if x.any() else np.nan, axis=1)
Output:
choice_a choice_b choice_c choices
0 0 0 1 [c]
1 0 1 0 [b]
2 1 0 0 [a]
3 1 0 1 [a, c]
4 0 0 0 NaN