How to create a new dataframe and add new variables in Python?-CodePudding

I created two random variables (x and y) with certain properties. Now, I want to create a dataframe from scratch out of these two variables. Unfortunately, what I type seems to be wrong. How can I do this correctly?

# creating variable x with Bernoulli distribution
from scipy.stats import bernoulli, binom
x = bernoulli.rvs(size=100,p=0.6)

# form a column vector (n, 1)
x = x.reshape(-100, 1)
print(x)

# creating variable y with normal distribution
y = norm.rvs(size=100,loc=0,scale=1)

# form a column vector (n, 1)
y = y.reshape(-100, 1)
print(y)

# creating a dataframe from scratch and assigning x and y to it
df = pd.DataFrame()  
df.assign(y = y,  x = x)
df

CodePudding user response：

There are a lot of ways to go about this.

According to the documentation pd.DataFrame accepts ndarray (structured or homogeneous), Iterable, dict, or DataFrame. Your issue is that x and y are 2d numpy array

>>> x.shape
(100, 1)

where it expects either one 1d array per column or a single 2d array.

One way would be to stack the array into one before calling the DataFrame constructor

>>> pd.DataFrame(np.hstack([x,y]))
      0         1
0   0.0  0.764109
1   1.0  0.204747
2   1.0 -0.706516
3   1.0 -1.359307
4   1.0  0.789217
..  ...       ...
95  1.0  0.227911
96  0.0 -0.238646
97  0.0 -1.468681
98  0.0  1.202132
99  0.0  0.348248

The alernatives mostly revolve around calling np.Array.flatten(). e.g. to construct a dict

>>> pd.DataFrame({'x': x.flatten(), 'y': y.flatten()})
    x         y
0   0  0.764109
1   1  0.204747
2   1 -0.706516
3   1 -1.359307
4   1  0.789217
.. ..       ...
95  1  0.227911
96  0 -0.238646
97  0 -1.468681
98  0  1.202132
99  0  0.348248