python pandas unnest data column containing a list of dictionaries-CodePudding

we have the following dataframe:

import pandas as pd
our_df = pd.DataFrame(data = {'rank': {0: 1, 1: 2}, 'title_name': {0: "And It's Still Alright", 1: 'Black Madonna'}, 'title_id': {0: '120034150', 1: '106938609'}, 'artist_id': {0: '222521', 1: '200160'}, 'artist_name': {0: 'Nathaniel Rateliff', 1: 'Cage The Elephant'}, 'label': {0: 'CNCO', 1: 'RCA'}, 'metrics': {0: [{'name': 'Rank', 'value': 1}, {'name': 'Song', 'value': "And It's Still Alright"}, {'name': 'Artist', 'value': 'Nathaniel Rateliff'}, {'name': 'TP Spins', 'value': 933}, {'name': ' /- Chg. Spins', 'value': -32}, {'name': 'LP Spins', 'value': 965}, {'name': 'Stations', 'value': '44/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1260000}, {'name': ' /- Chg. Audience', 'value': -40600}, {'name': 'LP Audience', 'value': 1300600}, {'name': 'TP Stream', 'value': 413101}], 1: [{'name': 'Rank', 'value': 2}, {'name': 'Song', 'value': 'Black Madonna'}, {'name': 'Artist', 'value': 'Cage The Elephant'}, {'name': 'TP Spins', 'value': 814}, {'name': ' /- Chg. Spins', 'value': 38}, {'name': 'LP Spins', 'value': 776}, {'name': 'Stations', 'value': '38/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1283400}, {'name': ' /- Chg. Audience', 'value': -21600}, {'name': 'LP Audience', 'value': 1305000}, {'name': 'TP Stream', 'value': 362366}]}})

and we are looking to convert the metrics column into 12 new columns in our dataframe, using the metric's name field as the column name, and value field as the field in the dataframe. Something like this:

rank               title_name  title_id  artist_id          artist_name    label  Rank                      Song ...   
1    'And It's Still Alright' 120034150     222521 'Nathaniel Rateliff'   'CNCO'     1  "And It's Still Alright"

Here's what the value in the metrics column looks like for row 1:

our_df['metrics'][0]
[{'name': 'Rank', 'value': 1},
 {'name': 'Song', 'value': "And It's Still Alright"},
 {'name': 'Artist', 'value': 'Nathaniel Rateliff'},
 {'name': 'TP Spins', 'value': 933},
 {'name': ' /- Chg. Spins', 'value': -32},
 {'name': 'LP Spins', 'value': 965},
 {'name': 'Stations', 'value': '44/46'},
 {'name': 'Adds', 'value': 0},
 {'name': 'TP Audience', 'value': 1260000},
 {'name': ' /- Chg. Audience', 'value': -40600},
 {'name': 'LP Audience', 'value': 1300600},
 {'name': 'TP Stream', 'value': 413101}]

The /- in the column names may be problematic though, along with the . in Chg. This dataframe would be best if all the column names were snake_case, if the /- was replaced with plus_minus, and if the . in Chg. was simply dropped.

Edit: we can assume that the metric names will be the same in every row in the dataframe. However, there may be other dataframes with different metric names, so it would be preferable if the names 'Rank', 'Song', 'Artist', etc. were not hardcoded. Here is the original list before it was converted into a pandas dataframe:

raw_data = [{'rank': 1,
  'title_name': 'BUTTER',
  'title_id': '',
  'artist_id': '',
  'artist_name': 'BTS',
  'label': '',
  'peak_position': 1,
  'last_week_rank': 7,
  'last_2week_rank': 8,
  'metrics': [{'name': 'Rank', 'value': 1},
   {'name': 'Song', 'value': 'BUTTER'},
   {'name': 'Artist', 'value': 'BTS'},
   {'name': 'Label Description', 'value': None},
   {'name': 'Label', 'value': '    '},
   {'name': 'Last Week Rank', 'value': 7},
   {'name': 'Last 2 Week Rank', 'value': 8},
   {'name': 'Weeks On Chart', 'value': 15}]},
 {'rank': 2,
  'title_name': 'STAY',
  'title_id': '',
  'artist_id': '',
  'artist_name': 'THE KID LAROI & JUS',
  'label': '',
  'peak_position': 1,
  'last_week_rank': 1,
  'last_2week_rank': 1,
  'metrics': [{'name': 'Rank', 'value': 2},
   {'name': 'Song', 'value': 'STAY'},
   {'name': 'Artist', 'value': 'THE KID LAROI & JUS'},
   {'name': 'Label Description', 'value': None},
   {'name': 'Label', 'value': '    '},
   {'name': 'Last Week Rank', 'value': 1},
   {'name': 'Last 2 Week Rank', 'value': 1},
   {'name': 'Weeks On Chart', 'value': 8}]}]

CodePudding user response：

Most likely, the fastest way is to process raw_data as a dictionary and only then construct a DataFrame with it.

records = []
for rec in raw_data:
    for metric in rec['metrics']:
        # process name: snake_case > drop '.' > ' /-' to 'plus_minus'
        name = metric['name'].lower().replace(' ','_').replace('.','').replace(' /-','plus_minus')
        rec[name] = metric['value']
    rec.pop('metrics')  # drop metric records
    records.append(rec)

df = pd.DataFrame(records)

Output

Resulting df

	rank	title_name	title_id	artist_id	artist_name	label	peak_position	last_week_rank	last_2week_rank	song	artist	label_description	last_2_week_rank	weeks_on_chart
0	1	BUTTER			BTS		1	7	8	BUTTER	BTS		8	15
1	2	STAY			THE KID LAROI & JUS		1	1	1	STAY	THE KID LAROI & JUS		1	8

Setup

raw_data = [{'rank': 1,
  'title_name': 'BUTTER',
  'title_id': '',
  'artist_id': '',
  'artist_name': 'BTS',
  'label': '',
  'peak_position': 1,
  'last_week_rank': 7,
  'last_2week_rank': 8,
  'metrics': [{'name': 'Rank', 'value': 1},
   {'name': 'Song', 'value': 'BUTTER'},
   {'name': 'Artist', 'value': 'BTS'},
   {'name': 'Label Description', 'value': None},
   {'name': 'Label', 'value': '    '},
   {'name': 'Last Week Rank', 'value': 7},
   {'name': 'Last 2 Week Rank', 'value': 8},
   {'name': 'Weeks On Chart', 'value': 15}]},
 {'rank': 2,
  'title_name': 'STAY',
  'title_id': '',
  'artist_id': '',
  'artist_name': 'THE KID LAROI & JUS',
  'label': '',
  'peak_position': 1,
  'last_week_rank': 1,
  'last_2week_rank': 1,
  'metrics': [{'name': 'Rank', 'value': 2},
   {'name': 'Song', 'value': 'STAY'},
   {'name': 'Artist', 'value': 'THE KID LAROI & JUS'},
   {'name': 'Label Description', 'value': None},
   {'name': 'Label', 'value': '    '},
   {'name': 'Last Week Rank', 'value': 1},
   {'name': 'Last 2 Week Rank', 'value': 1},
   {'name': 'Weeks On Chart', 'value': 8}]}]

Using the example's data as raw_data, i.e.

our_df = pd.DataFrame(data = {'rank': {0: 1, 1: 2}, 'title_name': {0: "And It's Still Alright", 1: 'Black Madonna'}, 'title_id': {0: '120034150', 1: '106938609'}, 'artist_id': {0: '222521', 1: '200160'}, 'artist_name': {0: 'Nathaniel Rateliff', 1: 'Cage The Elephant'}, 'label': {0: 'CNCO', 1: 'RCA'}, 'metrics': {0: [{'name': 'Rank', 'value': 1}, {'name': 'Song', 'value': "And It's Still Alright"}, {'name': 'Artist', 'value': 'Nathaniel Rateliff'}, {'name': 'TP Spins', 'value': 933}, {'name': ' /- Chg. Spins', 'value': -32}, {'name': 'LP Spins', 'value': 965}, {'name': 'Stations', 'value': '44/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1260000}, {'name': ' /- Chg. Audience', 'value': -40600}, {'name': 'LP Audience', 'value': 1300600}, {'name': 'TP Stream', 'value': 413101}], 1: [{'name': 'Rank', 'value': 2}, {'name': 'Song', 'value': 'Black Madonna'}, {'name': 'Artist', 'value': 'Cage The Elephant'}, {'name': 'TP Spins', 'value': 814}, {'name': ' /- Chg. Spins', 'value': 38}, {'name': 'LP Spins', 'value': 776}, {'name': 'Stations', 'value': '38/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1283400}, {'name': ' /- Chg. Audience', 'value': -21600}, {'name': 'LP Audience', 'value': 1305000}, {'name': 'TP Stream', 'value': 362366}]}})

raw_data = our_df.to_dict(orient='records')

Output

Resulting df from the solution above

	rank	title_name	title_id	artist_id	artist_name	label	song	artist	tp_spins	plus_minus_chg_spins	lp_spins	stations	adds	tp_audience	plus_minus_chg_audience	lp_audience	tp_stream
0	1	And It's Still Alright	120034150	222521	Nathaniel Rateliff	CNCO	And It's Still Alright	Nathaniel Rateliff	933	-32	965	44/46	0	1260000	-40600	1300600	413101
1	2	Black Madonna	106938609	200160	Cage The Elephant	RCA	Black Madonna	Cage The Elephant	814	38	776	38/46	0	1283400	-21600	1305000	362366

CodePudding user response：

Let's start decomposing your issue. After defining our_df we can generate a new dataframe based on the column metrics with:

pd.concat([pd.DataFrame({x['name']:x['value'] for x in y},index=[0]) for y in our_df['metrics']]

Which outputs:

   Rank                    Song  ... LP Audience  TP Stream
0     1  And It's Still Alright  ...     1300600     413101
0     2           Black Madonna  ...     1305000     362366

Next it's just a question of joining them together with pd.concat() or merge. I assume the common key is the column Rank therefore I'll use merge:

our_df.drop(columns=['metrics']).merge(pd.concat([pd.DataFrame({x['name']:x['value'] for x in y},index=[0]) for y in our_df['metrics']]),left_on='rank',right_on='Rank')

Outputting the full dataframe

    rank              title_name  ... LP Audience TP Stream
0     1  And It's Still Alright  ...     1300600    413101
1     2           Black Madonna  ...     1305000    362366

CodePudding user response：

Alternative that might be robust against missing names

metric_df = our_df.apply(
    lambda r:
        pd.Series(
            index=list(map(lambda d: d['name'], r['metrics'])) ['rank'],
            data=list(map(lambda d: d['value'], r['metrics'])) [r['rank']],
        ),
    axis=1,
)

our_df.merge(metric_df, on='rank')

CodePudding user response：

box = pd.concat({index  : pd.DataFrame(ent) 
                 for index, ent in 
                 zip( our_df.index, our_df.metrics)})

( our_df
  .drop(columns = 'metrics')
  .join(box.droplevel(-1))
  .pivot(['rank', 'title_name', 'title_id', 'artist_id', 'artist_name', 'label'], 
          'name', 
          'value')
  .reset_index()
)

name  rank              title_name   title_id artist_id         artist_name label  /- Chg. Audience  /- Chg. Spins Adds              Artist LP Audience LP Spins Rank                    Song Stations TP Audience TP Spins TP Stream
0        1  And It's Still Alright  120034150    222521  Nathaniel Rateliff  CNCO            -40600            -32    0  Nathaniel Rateliff     1300600      965    1  And It's Still Alright    44/46     1260000      933    413101
1        2           Black Madonna  106938609    200160   Cage The Elephant   RCA            -21600             38    0   Cage The Elephant     1305000      776    2           Black Madonna    38/46     1283400      814    362366

Working on the raw_data:

from itertools import chain, product

metrics = [ent['metrics']  for ent in raw_data]
non_metrics = [{key : value 
                for key, value 
                in ent.items() 
                if key != 'metrics'} 
                for ent in raw_data]

combo = zip(metrics, non_metrics)
combo = (product(metrics, [non_metrics]) 
         for metrics, non_metrics in combo)
combo = chain.from_iterable(combo)
combo = [{**left, **right} for left, right in combo]

pd.DataFrame(combo)

                 name                value  rank title_name title_id artist_id          artist_name label  peak_position  last_week_rank  last_2week_rank
0                Rank                    1     1     BUTTER                                     BTS                    1               7                8
1                Song               BUTTER     1     BUTTER                                     BTS                    1               7                8
2              Artist                  BTS     1     BUTTER                                     BTS                    1               7                8
3   Label Description                 None     1     BUTTER                                     BTS                    1               7                8
4               Label                          1     BUTTER                                     BTS                    1               7                8
5      Last Week Rank                    7     1     BUTTER                                     BTS                    1               7                8
6    Last 2 Week Rank                    8     1     BUTTER                                     BTS                    1               7                8
7      Weeks On Chart                   15     1     BUTTER                                     BTS                    1               7                8
8                Rank                    2     2       STAY                     THE KID LAROI & JUS                    1               1                1
9                Song                 STAY     2       STAY                     THE KID LAROI & JUS                    1               1                1
10             Artist  THE KID LAROI & JUS     2       STAY                     THE KID LAROI & JUS                    1               1                1
11  Label Description                 None     2       STAY                     THE KID LAROI & JUS                    1               1                1
12              Label                          2       STAY                     THE KID LAROI & JUS                    1               1                1
13     Last Week Rank                    1     2       STAY                     THE KID LAROI & JUS                    1               1                1
14   Last 2 Week Rank                    1     2       STAY                     THE KID LAROI & JUS                    1               1                1
15     Weeks On Chart                    8     2       STAY                     THE KID LAROI & JUS                    1               1                1

You can then reshape/transform into whatever you desire.