I created two random variables (x and y) with certain properties. Now, I want to create a dataframe from scratch out of these two variables. Unfortunately, what I type seems to be wrong. How can I do this correctly?
# creating variable x with Bernoulli distribution
from scipy.stats import bernoulli, binom
x = bernoulli.rvs(size=100,p=0.6)
# form a column vector (n, 1)
x = x.reshape(-100, 1)
print(x)
# creating variable y with normal distribution
y = norm.rvs(size=100,loc=0,scale=1)
# form a column vector (n, 1)
y = y.reshape(-100, 1)
print(y)
# creating a dataframe from scratch and assigning x and y to it
df = pd.DataFrame()
df.assign(y = y, x = x)
df
CodePudding user response:
There are a lot of ways to go about this.
According to the documentation pd.DataFrame
accepts ndarray (structured or homogeneous), Iterable, dict, or DataFrame
. Your issue is that x
and y
are 2d numpy array
>>> x.shape
(100, 1)
where it expects either one 1d array per column or a single 2d array.
One way would be to stack the array into one before calling the DataFrame
constructor
>>> pd.DataFrame(np.hstack([x,y]))
0 1
0 0.0 0.764109
1 1.0 0.204747
2 1.0 -0.706516
3 1.0 -1.359307
4 1.0 0.789217
.. ... ...
95 1.0 0.227911
96 0.0 -0.238646
97 0.0 -1.468681
98 0.0 1.202132
99 0.0 0.348248
The alernatives mostly revolve around calling np.Array.flatten()
. e.g. to construct a dict
>>> pd.DataFrame({'x': x.flatten(), 'y': y.flatten()})
x y
0 0 0.764109
1 1 0.204747
2 1 -0.706516
3 1 -1.359307
4 1 0.789217
.. .. ...
95 1 0.227911
96 0 -0.238646
97 0 -1.468681
98 0 1.202132
99 0 0.348248