Home > database >  How to flatten a column of arrays in DataFrame, apply a function, and restore the structure?
How to flatten a column of arrays in DataFrame, apply a function, and restore the structure?

Time:12-14

I have a DataFrame that looks as follows. There are two columns, the second of which contains numpy arrays that differ in shape (here: (2, 1), (2, 2), (2, 3)). Example:

   class                         data
0      0                  [[3], [17]]
1      1            [[9, 5], [8, 19]]
2      1  [[8, 16, 13], [17, 19, 10]]

I would now like to flatten the data column to get a 1D array [3, 17, 9, 5, 8, 19, 8, 16, 13, 17, 19, 10], apply a function to this vector, and restore the original shape of the DataFrame. For example, if I want to subtract the mean of the vector from all elements, the desired output is this:

   class                      data
0      0               [[-9], [5]]
1      1       [[-3, -7], [-4, 7]]
2      1  [[-4, 4, 1], [5, 7, -2]]

How can I best achieve this transformation?

Edit for @mozway:

I generated the DataFrame like this:

data = []

np.random.seed(8)

for i in range(1, 4):
    data.append(np.random.randint(0, 20, (2, i)))

category = {"class": [0, 1, 1]}
df = pd.DataFrame(category)
df["data"] = data

A function to transform the 1D array mentioned before would be arr -= np.mean(arr).

CodePudding user response:

Assuming flat is an array as below:

[-9.  5. -3. -7. -4.  7. -4.  4.  1.  5.  7. -2.]

One approach could be the following:

def nested_unflatten(da, placeholder):
    res = []
    for e in placeholder:
        if isinstance(e, Iterable):
            res.append(nested_unflatten(da, e))
        else:
            res.append(next(da))
    return res


flat = np.array([-9.,  5., -3., -7., -4.,  7., -4.,  4.,  1.,  5.,  7., -2.])
un_flat = nested_unflatten(iter(flat), df["data"])
print(un_flat)

Output

[[[-9.0], [5.0]], [[-3.0, -7.0], [-4.0, 7.0]], [[-4.0, 4.0, 1.0], [5.0, 7.0, -2.0]]]

If you are also interested in a flatten function:

def flatten(da):
    res = []
    for e in da:
        if isinstance(e, Iterable):
            res.extend(flatten(e))
        else:
            res.append(e)
    return res

It can be used to obtain flat from the example, like:

 flat = np.array(flatten(df["data"]), dtype=np.float64)
 flat -= np.mean(flat)
  • Related