Handle missing data when flattening nested array field in pandas dataframe-CodePudding

We need to flatten this into a standard 2D DataFrame:

arr = [
    [{ 'id': 3, 'abbr': 'ORL', 'record': { 'win': 3, 'loss': 7 }}, 
     { 'id': 5, 'abbr': 'ATL', 'record': { 'win': 3, 'loss': 7 }}],
    [{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }}, 
     { 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })

Here's our code, which is working just fine for this dummy example

for i in range(len(mydf)):
    output_list = []
    for i in range(len(mydf)):
        team1 = mydf['arr'][i][0]
        team2 = mydf['arr'][i][1]
        zed = { 't1': team1['abbr'], 't2': team2['abbr'] }
        output_list.append(zed)

    output_df = pd.DataFrame(output_list)
    
final_df = pd.concat([mydf, output_df], axis=1)
final_df.pop('arr')
final_df

    name    t1  t2
0   nick    ORL ATL
1   tom     NYK BOS

Our source of data is not reliable and ma have missing values, and our code seems fraught with structural weaknesses. In particular, errors are thrown when either of these are the raw data (missing field, missing dict):

# missing dict
arr = [
    [{ 'id': 3, 'abbr': 'ORL', 'record': { 'win': 3, 'loss': 7 }}],
    [{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }}, 
     { 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
mydf = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })

# missing "abbr" field
arr = [
    [{ 'id': 3, 'record': { 'win': 3, 'loss': 7 }}, 
     { 'id': 5, 'abbr': 'ATL', 'record': { 'win': 3, 'loss': 7 }}],
    [{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }}, 
     { 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
mydf = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })

Is it possible to (a) replace the for-loop with a more structurally sound approach (apply), and (b) handle the missing data concerns?

CodePudding user response：

The main issue with your code is that "abbr" key may not exist. You could account for that using dict.get method. If you replace:

zed = { 't1': team1['abbr'], 't2': team2['abbr'] }

with

zed = { 't1': team1.get('abbr', np.nan), 't2': team2.get('abbr', np.nan) }

it will work as expected.

An alternative approach that doesn't use explicit loop:

You could explode and str.get the abbr field; convert it to a list; build a DataFrame with it and join it back to df:

df = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
out = (df.join(pd.DataFrame((df['arr']
                             .explode()
                             .str.get('abbr')
                             .groupby(level=0)
                             .agg(list)
                             .tolist()), 
                           columns=['t1','t2'], index=df.index))
       .drop(columns='arr')
       .fillna(np.nan))

For the sample that works in your code:

name   t1   t2
0  nick  ORL  ATL
1   tom  NYK  BOS

For the first sample that doesn't work:

   name   t1   t2
0  nick  ORL  NaN
1   tom  NYK  BOS

For the second sample that doesn't work:

   name   t1   t2
0  nick  NaN  ATL
1   tom  NYK  BOS