Home > other >  Split Numpy array (800 * x) where each row is a string of space separeted feature values
Split Numpy array (800 * x) where each row is a string of space separeted feature values

Time:03-17

I have a NumPy array of the form (3*1):

['1 2 3', '4 5 6 7 8 9', '10 11 12 13']. 

Each row is a collection of feature values and each row has a different number of features. How can I split this array into multiple columns with each column representing a feature for all rows?

Update: I am now using on DataFrames. So, I have a data frame with one column. Want to expand it and also convert it to integers. How can I do that?

enter image description here

Expectation: I want to split data in each row into multiple columns (features). Then, each column to numerics.

CodePudding user response:

You can use expand=True in str.split like below:

df = pd.DataFrame({'features':['191 367 614 634 711',
                               '1202 1220 131 1730 2281 2572 2602 2611 2824', 
                               '2855 2940 3149 3313 3560 3568 3824 4185 4266']})

df.features.str.split(expand=True).astype(float).add_prefix('feature_')

Output:

   feature_0  feature_1  feature_2  ...  feature_6  feature_7  feature_8
0      191.0      367.0      614.0  ...        NaN        NaN        NaN
1     1202.0     1220.0      131.0  ...     2602.0     2611.0     2824.0
2     2855.0     2940.0     3149.0  ...     3824.0     4185.0     4266.0

CodePudding user response:

Since numpy doesn't support non-rectangular arrays, it would be easiest to use pandas for this:

df = pd.concat([pd.Series(l) for l in pd.Series(a).str.split(' ').tolist()], axis=1).T.add_prefix('feature_')

Output:

>>> df
  feature_0 feature_1 feature_2 feature_3 feature_4 feature_5
0         1         2         3       NaN       NaN       NaN
1         4         5         6         7         8         9
2        10        11        12        13       NaN       NaN

CodePudding user response:

You show a list (but the following also works if it's an array of strings):

In [44]: alist = ['1 2 3', '4 5 6 7 8 9', '10 11 12 13']
In [45]: alist
Out[45]: ['1 2 3', '4 5 6 7 8 9', '10 11 12 13']

Split each string into a list:

In [46]: blist = [s.split() for s in alist]
In [47]: blist
Out[47]: [['1', '2', '3'], ['4', '5', '6', '7', '8', '9'], ['10', '11', '12', '13']]

convert each number string to int:

In [48]: clist = [[int(x) for x in s] for s in blist]
In [49]: clist
Out[49]: [[1, 2, 3], [4, 5, 6, 7, 8, 9], [10, 11, 12, 13]]

Since the sublists differ in length, I'm going to refrain from processing this as arrays. If you specify how such a list of lists maps on to your notion of features, we can proceed.

  • Related