I'm trying to set create a new column on my DataFrame grouping two existing columns
import pandas as pd
import numpy as np
DATA=pd.DataFrame(np.random.randn(5,2), columns=['A', 'B'])
DATA['index']=np.arange(5)
DATA.set_index('index', inplace=True)
The output is something like this
'A' 'B'
index
0 -0.003635 -0.644897
1 -0.617104 -0.343998
2 1.270503 -0.514588
3 -0.053097 -0.404073
4 -0.056717 1.870671
I would like to have an extra column 'C'
that has an np.array
with the elements of 'A'
and 'B'
for the corresponding row. In the real case, 'A'
and 'B'
are already 1D np.arrays
, but of different lengths. I would like to make a longer array with all the elements stacked or concatenated.
Thanks
CodePudding user response:
If columns a
and b
contains numpy arrays, you could apply hstack
across rows:
import pandas as pd
import numpy as np
num_rows = 10
max_arr_size = 3
df = pd.DataFrame({
"a": [np.random.rand(max_arr_size) for _ in range(num_rows)],
"b": [np.random.rand(max_arr_size) for _ in range(num_rows)],
})
df["c"] = df.apply(np.hstack, 1)
assert all(row.a.size row.b.size == row.c.size for _, row in df.iterrows())
CodePudding user response:
DATA['C'] = DATA.apply(lambda x: np.array([x.A, x.B]), axis=1)
pandas requires all rows to be of the same length so the problem of uneven pandas series shouldn't be present