We need to flatten this into a standard 2D DataFrame:
arr = [
[{ 'id': 3, 'abbr': 'ORL', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 5, 'abbr': 'ATL', 'record': { 'win': 3, 'loss': 7 }}],
[{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
Here's our code, which is working just fine for this dummy example
for i in range(len(mydf)):
output_list = []
for i in range(len(mydf)):
team1 = mydf['arr'][i][0]
team2 = mydf['arr'][i][1]
zed = { 't1': team1['abbr'], 't2': team2['abbr'] }
output_list.append(zed)
output_df = pd.DataFrame(output_list)
final_df = pd.concat([mydf, output_df], axis=1)
final_df.pop('arr')
final_df
name t1 t2
0 nick ORL ATL
1 tom NYK BOS
Our source of data is not reliable and ma have missing values, and our code seems fraught with structural weaknesses. In particular, errors are thrown when either of these are the raw data (missing field, missing dict):
# missing dict
arr = [
[{ 'id': 3, 'abbr': 'ORL', 'record': { 'win': 3, 'loss': 7 }}],
[{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
mydf = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
# missing "abbr" field
arr = [
[{ 'id': 3, 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 5, 'abbr': 'ATL', 'record': { 'win': 3, 'loss': 7 }}],
[{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
mydf = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
Is it possible to (a) replace the for-loop with a more structurally sound approach (apply), and (b) handle the missing data concerns?
CodePudding user response:
The main issue with your code is that "abbr" key may not exist. You could account for that using dict.get
method. If you replace:
zed = { 't1': team1['abbr'], 't2': team2['abbr'] }
with
zed = { 't1': team1.get('abbr', np.nan), 't2': team2.get('abbr', np.nan) }
it will work as expected.
An alternative approach that doesn't use explicit loop:
You could explode
and str.get
the abbr
field; convert it to a list; build a DataFrame with it and join
it back to df
:
df = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
out = (df.join(pd.DataFrame((df['arr']
.explode()
.str.get('abbr')
.groupby(level=0)
.agg(list)
.tolist()),
columns=['t1','t2'], index=df.index))
.drop(columns='arr')
.fillna(np.nan))
For the sample that works in your code:
name t1 t2
0 nick ORL ATL
1 tom NYK BOS
For the first sample that doesn't work:
name t1 t2
0 nick ORL NaN
1 tom NYK BOS
For the second sample that doesn't work:
name t1 t2
0 nick NaN ATL
1 tom NYK BOS