I am wondering why pandas assign function cannot handle returned lists.
For example
df = pd.DataFrame({
"id" : [1,2,3,4,5],
"val" : [10,20,30,30,40]
})
def squareMe(x):
return x**2
df = df.assign(val2 = lambda x: squareMe(x.val))
# Out > Works fine : Returns a DataFrame with squared values
But if we return a list,
def squareMe(x):
return [x**2]
df = df.assign(val2 = lambda x: squareMe(x.val))
#Out > ValueError: Length of values (1) does not match length of index (5)
However pandas apply function works fine when returning a list
def squareMe(x):
return [x**2]
df["val2"] = df.val.apply(lambda x: squareMe(x))
Any particular reason why this is or am I doing something wrong?
CodePudding user response:
Since you reference x.val
in the call to squareMe
, that function is passed a list (you can easily verify this by adding a debug statement to print type(x)
inside the function).
Thus, x ** 2
returns a Series (since the expression is vectorized) and the assignment works correctly.
But when you return [x ** 2]
you're returning the Series inside a list, which doesn't make Sense to apply since all it sees is an iterable of size "1" (the series inside it) and it deems this to be the incorrect length for performing a column assignment to a DataFrame of size 5 (which is exactly what ValueError: Length of values (1) does not match length of index (5)
means).
The difference is with apply
is that the function receives a number, not a series. And so you still return a single item (a list) which apply accepts, but is still technically wrong since you shouldn't need to wrap the result in a list.
More information: df.assign, df.apply
P.S.: you probably already understand this, but you can simplify this to df['val'] = df['x'] ** 2
CodePudding user response:
assign
isn't particularly meant for this, it is for assigning columns already returned sequences as the arguments.
Docs:
**Parameters : kwargs : dict of {str: callable or Series}
The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
Doing [x ** 2]
returns a series of lists which would be treated like a matrix (or dataframe), and therefore as the error mentions:
ValueError: Length of values (1) does not match length of index (5)
The length of values wouldn't match to the index.