Home > other >  cleaning dataframe columns with single cell arrays of different types
cleaning dataframe columns with single cell arrays of different types

Time:05-11

I am working on a large dataframe with multiple columns. However, some of columns have data in the form of arrays with in arrays (single value). I need to convert the dataframe columns with only cell values i.e., without the array element style. I have tried flatten, squeeze in different ways, but could not get the output in the way I am looking. Following code reproduces the data format I am working at present:

import pandas as pd
a = [[[10]],[[20]],[[30]],[[40]]]
b=[[50],[60],[70],[80]]
c=[90,100,110,120]
df = pd.DataFrame(list(zip(a,b,c)),columns=['a','b','c'])
print(df)

The output of the above is:

        a     b    c
0  [[10]]  [50]   90
1  [[20]]  [60]  100
2  [[30]]  [70]  110
3  [[40]]  [80]  120

However, I am looking to get the output as below:

    a   b    c
0  10  50   90
1  20  60  100
2  30  70  110
3  40  80  120

It would really help, if you could suggest how to approach this problem.

CodePudding user response:

Maybe not the best solution. But it works.

def ravel_series(s):
    try:
        return np.concatenate(s).ravel()
    except ValueError:
        return s

df.apply(ravel_series)

CodePudding user response:

You can try this,

Code:

def clean(el):
  if any(isinstance(i, list) for i in el):
    return el[0][0]
  elif isinstance(row, list):
    return el[0]

df['a'] = df.a.apply(clean)
df['b'] = df.b.apply(clean)

print(df)

Output:

    a   b    c
0  10  50   90
1  20  60  100
2  30  70  110
3  40  80  120

CodePudding user response:

You can unnest the list with the str locator:

df['a'].str[0].str[0]

output:

0    10
1    20
2    30
3    40
Name: a, dtype: int64

To automatize things a bit, you can use a recursive function:

def unnest(x):
    from pandas.api.types import is_numeric_dtype
    if is_numeric_dtype(x):
        return x
    else:
        return unnest(x.str[0])

df2 = df.apply(unnest)

variant using the first item of each Series to determine the nesting level:

def unnest(x):
    from pandas.api.types import is_numeric_dtype
    if len(x)>0 and isinstance(x.iloc[0], list):
        return unnest(x.str[0])
    else:
        return x

df2 = df.apply(unnest)

output:

    a   b    c
0  10  50   90
1  20  60  100
2  30  70  110
3  40  80  120
arbitrary nesting

If you had an arbitrary nesting for each cell, you could use the same logic per element:

def unnest(x):
    if isinstance(x, list) and len(x)>0:
        return unnest(x[0])
    else:
        return x
    
df2 = df.applymap(unnest)
  • Related