I encounter a problem with numpy arrays. I used CountVectorizer from sklearn with a wordset and values (from pandas column) to create an array of arrays that count words (BoW). And when I print the array and the shape, I have this result:
[[array([0, 5, 0, ..., 0, 0, 0])]
[array([0, 0, 0, ..., 0, 0, 0])]
[array([0, 0, 0, ..., 0, 0, 0])]
...
[array([0, 0, 0, ..., 0, 0, 0])]
[array([0, 0, 0, ..., 0, 0, 0])]
[array([0, 0, 0, ..., 0, 0, 0])]] (2800, 1)
An array of arrays having a vector shape ???
I checked that all rows have the same size.
Here is a way to reproduce my problem:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
data = pd.DataFrame(["blop blip blup", "bop bip bup", "boop boip boup"], columns=["corpus"])
# add labels column
data["label"] = ["blop", "bip", "boup"]
wordset = pd.Series([y for x in data["corpus"].str.split() for y in x]).unique()
cvec = CountVectorizer(vocabulary=wordset, ngram_range=(1, 2))
labels_count_np = data["label"].apply(lambda x: cvec.fit_transform([x]).toarray()[0]).values
print(labels_count_np, labels_count_np.shape)
it should return:
[array([1, 0, 0, 0, 0, 0, 0, 0, 0]) array([0, 0, 0, 0, 1, 0, 0, 0, 0])
array([0, 0, 0, 0, 0, 0, 0, 0, 1])] (3,)
Can someone explain me why numpy has this comportment ?
Also, I tried to find a way to concatenate multiple arrays like this:
A = [array([1, 0, 0, 0, 0, 0, 0, 0, 0]) array([0, 0, 0, 0, 1, 0, 0, 0, 0])
array([0, 0, 0, 0, 0, 0, 0, 0, 1])]
B = [array([0, 7, 2, 0]) array([1, 4, 0, 8])
array([6, 1, 0, 9])]
concatenate(A,B) =>
[
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 2, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 4, 0, 8],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 6, 1, 0, 9]
]
But I did not found a good way to do it.
CodePudding user response:
You can concatenate using list comprehension:
C = [np.append(x, B[i]) for i, x in enumerate(A)]
OUTPUT
[array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 2, 0]),
array([0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 4, 0, 8]),
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 6, 1, 0, 9])]
CodePudding user response:
values
from a dataframe, even if it has just one column, will be 2d. values
from a Series, one column of the frame, will be 1d.
If labels_count_np
is (2800, 1) shape, you can easily make it 1d with labels_count_np[:,0]
or np.squeeze(labels...)
. That's just basic numpy
.
It will still be an object dtype array containing arrays, the elements of the dataframe cells. If those arrays are all the same size then
np.stack(labels_count_np[:,0])
should create a 2d numeric array.
Make a frame with array elements:
In [35]: df = pd.DataFrame([None,None,None], columns=['x'])
In [36]: df
Out[36]:
x
0 None
1 None
2 None
In [37]: for i in range(3):df['x'][i] = np.zeros(4,int)
In [38]: df
Out[38]:
x
0 [0, 0, 0, 0]
1 [0, 0, 0, 0]
2 [0, 0, 0, 0]
The 2d array from the frame:
In [39]: df.values
Out[39]:
array([[array([0, 0, 0, 0])],
[array([0, 0, 0, 0])],
[array([0, 0, 0, 0])]], dtype=object)
In [40]: _.shape
Out[40]: (3, 1)
from the Series:
In [41]: df['x'].values
Out[41]:
array([array([0, 0, 0, 0]), array([0, 0, 0, 0]), array([0, 0, 0, 0])],
dtype=object)
In [42]: _.shape
Out[42]: (3,)
Joining the Series values into one 2d array:
In [43]: np.stack(df['x'].values)
Out[43]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])