Home > Net >  python pandas unnest data column containing a list of dictionaries
python pandas unnest data column containing a list of dictionaries

Time:11-19

we have the following dataframe:

import pandas as pd
our_df = pd.DataFrame(data = {'rank': {0: 1, 1: 2}, 'title_name': {0: "And It's Still Alright", 1: 'Black Madonna'}, 'title_id': {0: '120034150', 1: '106938609'}, 'artist_id': {0: '222521', 1: '200160'}, 'artist_name': {0: 'Nathaniel Rateliff', 1: 'Cage The Elephant'}, 'label': {0: 'CNCO', 1: 'RCA'}, 'metrics': {0: [{'name': 'Rank', 'value': 1}, {'name': 'Song', 'value': "And It's Still Alright"}, {'name': 'Artist', 'value': 'Nathaniel Rateliff'}, {'name': 'TP Spins', 'value': 933}, {'name': ' /- Chg. Spins', 'value': -32}, {'name': 'LP Spins', 'value': 965}, {'name': 'Stations', 'value': '44/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1260000}, {'name': ' /- Chg. Audience', 'value': -40600}, {'name': 'LP Audience', 'value': 1300600}, {'name': 'TP Stream', 'value': 413101}], 1: [{'name': 'Rank', 'value': 2}, {'name': 'Song', 'value': 'Black Madonna'}, {'name': 'Artist', 'value': 'Cage The Elephant'}, {'name': 'TP Spins', 'value': 814}, {'name': ' /- Chg. Spins', 'value': 38}, {'name': 'LP Spins', 'value': 776}, {'name': 'Stations', 'value': '38/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1283400}, {'name': ' /- Chg. Audience', 'value': -21600}, {'name': 'LP Audience', 'value': 1305000}, {'name': 'TP Stream', 'value': 362366}]}})

and we are looking to convert the metrics column into 12 new columns in our dataframe, using the metric's name field as the column name, and value field as the field in the dataframe. Something like this:

rank               title_name  title_id  artist_id          artist_name    label  Rank                      Song ...   
1    'And It's Still Alright' 120034150     222521 'Nathaniel Rateliff'   'CNCO'     1  "And It's Still Alright"

Here's what the value in the metrics column looks like for row 1:

our_df['metrics'][0]
[{'name': 'Rank', 'value': 1},
 {'name': 'Song', 'value': "And It's Still Alright"},
 {'name': 'Artist', 'value': 'Nathaniel Rateliff'},
 {'name': 'TP Spins', 'value': 933},
 {'name': ' /- Chg. Spins', 'value': -32},
 {'name': 'LP Spins', 'value': 965},
 {'name': 'Stations', 'value': '44/46'},
 {'name': 'Adds', 'value': 0},
 {'name': 'TP Audience', 'value': 1260000},
 {'name': ' /- Chg. Audience', 'value': -40600},
 {'name': 'LP Audience', 'value': 1300600},
 {'name': 'TP Stream', 'value': 413101}]

The /- in the column names may be problematic though, along with the . in Chg. This dataframe would be best if all the column names were snake_case, if the /- was replaced with plus_minus, and if the . in Chg. was simply dropped.

Edit: we can assume that the metric names will be the same in every row in the dataframe. However, there may be other dataframes with different metric names, so it would be preferable if the names 'Rank', 'Song', 'Artist', etc. were not hardcoded. Here is the original list before it was converted into a pandas dataframe:

raw_data = [{'rank': 1,
  'title_name': 'BUTTER',
  'title_id': '',
  'artist_id': '',
  'artist_name': 'BTS',
  'label': '',
  'peak_position': 1,
  'last_week_rank': 7,
  'last_2week_rank': 8,
  'metrics': [{'name': 'Rank', 'value': 1},
   {'name': 'Song', 'value': 'BUTTER'},
   {'name': 'Artist', 'value': 'BTS'},
   {'name': 'Label Description', 'value': None},
   {'name': 'Label', 'value': '    '},
   {'name': 'Last Week Rank', 'value': 7},
   {'name': 'Last 2 Week Rank', 'value': 8},
   {'name': 'Weeks On Chart', 'value': 15}]},
 {'rank': 2,
  'title_name': 'STAY',
  'title_id': '',
  'artist_id': '',
  'artist_name': 'THE KID LAROI & JUS',
  'label': '',
  'peak_position': 1,
  'last_week_rank': 1,
  'last_2week_rank': 1,
  'metrics': [{'name': 'Rank', 'value': 2},
   {'name': 'Song', 'value': 'STAY'},
   {'name': 'Artist', 'value': 'THE KID LAROI & JUS'},
   {'name': 'Label Description', 'value': None},
   {'name': 'Label', 'value': '    '},
   {'name': 'Last Week Rank', 'value': 1},
   {'name': 'Last 2 Week Rank', 'value': 1},
   {'name': 'Weeks On Chart', 'value': 8}]}]

CodePudding user response:

Most likely, the fastest way is to process raw_data as a dictionary and only then construct a DataFrame with it.

records = []
for rec in raw_data:
    for metric in rec['metrics']:
        # process name: snake_case > drop '.' > ' /-' to 'plus_minus'
        name = metric['name'].lower().replace(' ','_').replace('.','').replace(' /-','plus_minus')
        rec[name] = metric['value']
    rec.pop('metrics')  # drop metric records
    records.append(rec)

df = pd.DataFrame(records)

Output

Resulting df

rank title_name title_id artist_id artist_name label peak_position last_week_rank last_2week_rank song artist label_description last_2_week_rank weeks_on_chart
0 1 BUTTER BTS 1 7 8 BUTTER BTS 8 15
1 2 STAY THE KID LAROI & JUS 1 1 1 STAY THE KID LAROI & JUS 1 8

Setup

raw_data = [{'rank': 1,
  'title_name': 'BUTTER',
  'title_id': '',
  'artist_id': '',
  'artist_name': 'BTS',
  'label': '',
  'peak_position': 1,
  'last_week_rank': 7,
  'last_2week_rank': 8,
  'metrics': [{'name': 'Rank', 'value': 1},
   {'name': 'Song', 'value': 'BUTTER'},
   {'name': 'Artist', 'value': 'BTS'},
   {'name': 'Label Description', 'value': None},
   {'name': 'Label', 'value': '    '},
   {'name': 'Last Week Rank', 'value': 7},
   {'name': 'Last 2 Week Rank', 'value': 8},
   {'name': 'Weeks On Chart', 'value': 15}]},
 {'rank': 2,
  'title_name': 'STAY',
  'title_id': '',
  'artist_id': '',
  'artist_name': 'THE KID LAROI & JUS',
  'label': '',
  'peak_position': 1,
  'last_week_rank': 1,
  'last_2week_rank': 1,
  'metrics': [{'name': 'Rank', 'value': 2},
   {'name': 'Song', 'value': 'STAY'},
   {'name': 'Artist', 'value': 'THE KID LAROI & JUS'},
   {'name': 'Label Description', 'value': None},
   {'name': 'Label', 'value': '    '},
   {'name': 'Last Week Rank', 'value': 1},
   {'name': 'Last 2 Week Rank', 'value': 1},
   {'name': 'Weeks On Chart', 'value': 8}]}]

Using the example's data as raw_data, i.e.

our_df = pd.DataFrame(data = {'rank': {0: 1, 1: 2}, 'title_name': {0: "And It's Still Alright", 1: 'Black Madonna'}, 'title_id': {0: '120034150', 1: '106938609'}, 'artist_id': {0: '222521', 1: '200160'}, 'artist_name': {0: 'Nathaniel Rateliff', 1: 'Cage The Elephant'}, 'label': {0: 'CNCO', 1: 'RCA'}, 'metrics': {0: [{'name': 'Rank', 'value': 1}, {'name': 'Song', 'value': "And It's Still Alright"}, {'name': 'Artist', 'value': 'Nathaniel Rateliff'}, {'name': 'TP Spins', 'value': 933}, {'name': ' /- Chg. Spins', 'value': -32}, {'name': 'LP Spins', 'value': 965}, {'name': 'Stations', 'value': '44/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1260000}, {'name': ' /- Chg. Audience', 'value': -40600}, {'name': 'LP Audience', 'value': 1300600}, {'name': 'TP Stream', 'value': 413101}], 1: [{'name': 'Rank', 'value': 2}, {'name': 'Song', 'value': 'Black Madonna'}, {'name': 'Artist', 'value': 'Cage The Elephant'}, {'name': 'TP Spins', 'value': 814}, {'name': ' /- Chg. Spins', 'value': 38}, {'name': 'LP Spins', 'value': 776}, {'name': 'Stations', 'value': '38/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1283400}, {'name': ' /- Chg. Audience', 'value': -21600}, {'name': 'LP Audience', 'value': 1305000}, {'name': 'TP Stream', 'value': 362366}]}})

raw_data = our_df.to_dict(orient='records')

Output

Resulting df from the solution above

rank title_name title_id artist_id artist_name label song artist tp_spins plus_minus_chg_spins lp_spins stations adds tp_audience plus_minus_chg_audience lp_audience tp_stream
0 1 And It's Still Alright 120034150 222521 Nathaniel Rateliff CNCO And It's Still Alright Nathaniel Rateliff 933 -32 965 44/46 0 1260000 -40600 1300600 413101
1 2 Black Madonna 106938609 200160 Cage The Elephant RCA Black Madonna Cage The Elephant 814 38 776 38/46 0 1283400 -21600 1305000 362366

CodePudding user response:

Let's start decomposing your issue. After defining our_df we can generate a new dataframe based on the column metrics with:

pd.concat([pd.DataFrame({x['name']:x['value'] for x in y},index=[0]) for y in our_df['metrics']]

Which outputs:

   Rank                    Song  ... LP Audience  TP Stream
0     1  And It's Still Alright  ...     1300600     413101
0     2           Black Madonna  ...     1305000     362366

Next it's just a question of joining them together with pd.concat() or merge. I assume the common key is the column Rank therefore I'll use merge:

our_df.drop(columns=['metrics']).merge(pd.concat([pd.DataFrame({x['name']:x['value'] for x in y},index=[0]) for y in our_df['metrics']]),left_on='rank',right_on='Rank')

Outputting the full dataframe

    rank              title_name  ... LP Audience TP Stream
0     1  And It's Still Alright  ...     1300600    413101
1     2           Black Madonna  ...     1305000    362366

CodePudding user response:

Alternative that might be robust against missing names

metric_df = our_df.apply(
    lambda r:
        pd.Series(
            index=list(map(lambda d: d['name'], r['metrics'])) ['rank'],
            data=list(map(lambda d: d['value'], r['metrics'])) [r['rank']],
        ),
    axis=1,
)

our_df.merge(metric_df, on='rank')

CodePudding user response:

box = pd.concat({index  : pd.DataFrame(ent) 
                 for index, ent in 
                 zip( our_df.index, our_df.metrics)})

( our_df
  .drop(columns = 'metrics')
  .join(box.droplevel(-1))
  .pivot(['rank', 'title_name', 'title_id', 'artist_id', 'artist_name', 'label'], 
          'name', 
          'value')
  .reset_index()
)

name  rank              title_name   title_id artist_id         artist_name label  /- Chg. Audience  /- Chg. Spins Adds              Artist LP Audience LP Spins Rank                    Song Stations TP Audience TP Spins TP Stream
0        1  And It's Still Alright  120034150    222521  Nathaniel Rateliff  CNCO            -40600            -32    0  Nathaniel Rateliff     1300600      965    1  And It's Still Alright    44/46     1260000      933    413101
1        2           Black Madonna  106938609    200160   Cage The Elephant   RCA            -21600             38    0   Cage The Elephant     1305000      776    2           Black Madonna    38/46     1283400      814    362366

Working on the raw_data:

from itertools import chain, product

metrics = [ent['metrics']  for ent in raw_data]
non_metrics = [{key : value 
                for key, value 
                in ent.items() 
                if key != 'metrics'} 
                for ent in raw_data]

combo = zip(metrics, non_metrics)
combo = (product(metrics, [non_metrics]) 
         for metrics, non_metrics in combo)
combo = chain.from_iterable(combo)
combo = [{**left, **right} for left, right in combo]

pd.DataFrame(combo)

                 name                value  rank title_name title_id artist_id          artist_name label  peak_position  last_week_rank  last_2week_rank
0                Rank                    1     1     BUTTER                                     BTS                    1               7                8
1                Song               BUTTER     1     BUTTER                                     BTS                    1               7                8
2              Artist                  BTS     1     BUTTER                                     BTS                    1               7                8
3   Label Description                 None     1     BUTTER                                     BTS                    1               7                8
4               Label                          1     BUTTER                                     BTS                    1               7                8
5      Last Week Rank                    7     1     BUTTER                                     BTS                    1               7                8
6    Last 2 Week Rank                    8     1     BUTTER                                     BTS                    1               7                8
7      Weeks On Chart                   15     1     BUTTER                                     BTS                    1               7                8
8                Rank                    2     2       STAY                     THE KID LAROI & JUS                    1               1                1
9                Song                 STAY     2       STAY                     THE KID LAROI & JUS                    1               1                1
10             Artist  THE KID LAROI & JUS     2       STAY                     THE KID LAROI & JUS                    1               1                1
11  Label Description                 None     2       STAY                     THE KID LAROI & JUS                    1               1                1
12              Label                          2       STAY                     THE KID LAROI & JUS                    1               1                1
13     Last Week Rank                    1     2       STAY                     THE KID LAROI & JUS                    1               1                1
14   Last 2 Week Rank                    1     2       STAY                     THE KID LAROI & JUS                    1               1                1
15     Weeks On Chart                    8     2       STAY                     THE KID LAROI & JUS                    1               1                1

You can then reshape/transform into whatever you desire.

  • Related