I have a pandas dataframe with three columns: user_id (str), list_of_purchases (list) and a binary column named b.
I would like to create a fourth column named final_list that follows the rules below:
- When b = 1, then final_list should be the concatenation of list_of_purchases and the item "Success". So for example, if list_of_purchases = ['item_1', 'item_2', 'item_3'] then final_list should be ['item_1', 'item_2', 'item_3','Success']
- When b = 0, then instead of "success", final_list should be ['item_1', 'item_2', 'item_3','Null']
I tried the following code but got the error:
df['final_list'] = np.where(
df['b'] == 0,
df['list_of_purchases'] ['Null'],
df['list_of_purchases'] ['Success'])
TypeError: Cannot broadcast np.ndarray with operand of type <class 'list'>
I figured out how to do it using a for loop and checking every row in column b, but it is really ineficient and takes a long time.
Thanks in advance for the help!
CodePudding user response:
#create a function:
def lista(df):
return [df['list_of_purchases'] ['Null'] if df['b'] == 0 else df['list_of_purchases'] ['Success']]
#use the function on every row of df:
df['final_list'] = df.apply(lista, axis=1)
from what I understand, pandas dataframes are not designed to store lists as their values, so super efficient solutions are not available
CodePudding user response:
Feels like a nice use for a lambda, rather than defining a function, though both approaches work.
import pandas as pd
import numpy as np
data = [[1, [1,2,3]],
[0, [4,5,6]]]
df = pd.DataFrame(data, columns=["b", "list_of_purchases"])
df["Output"] = df.apply( \
lambda row : row["list_of_purchases"] \
["Success" if row["b"] else "Null"], axis=1)
print(df)
Produces:
b list_of_purchases Output
0 1 [1, 2, 3] [1, 2, 3, Success]
1 0 [4, 5, 6] [4, 5, 6, Null]
"Benefits" of using a lambda are (here) basically just that it avoids defining a function which you might not ever reuse.
If the function/logic is to be reused elsewhere then defining a function (and not using a lambda) might be exactly the right approach. If it's only used for this, and not reused, I'd probably go with the lambda.