Home > Back-end >  Extracting values from dictionary list in pandas dataframe
Extracting values from dictionary list in pandas dataframe

Time:12-23

I have the following pandas dataframe:

pd.DataFrame({'keys': {3: 'brandId', 5: 'price', 14: 'sizes', 18: 'brandId', 20: 'price', 29: 'sizes', 30: 'condition', 31: 'condition', 32: 'colour', 33: 'age', 36: 'brand', 40: 'colour', 41: 'brand', 44: 'productType', 50: 'brandId', 52: 'price', 61: 'sizes', 62: 'condition', 63: 'colour', 64: 'age', 67: 'brand', 70: 'productType'}, 'values': {3: 925, 5: {'currencyName': 'GBP', 'priceAmount': '50.00', 'nationalShippingCost': '3.00'}, 14: {'id': 4, 'name': 'UK 4', 'quantity': 1}, 18: 925, 20: {'currencyName': 'GBP', 'priceAmount': '11.00', 'nationalShippingCost': '0.00'}, 29: {'id': 3, 'name': 'S', 'quantity': 1}, 30: {'id': 'used_like_new', 'name': 'Like new'}, 31: {'id': 'brand_new', 'name': 'Brand new'}, 32: {'id': 'multi', 'name': 'Multi'}, 33: {'id': 'modern', 'name': 'Modern'}, 36: 'chinese-laundry', 40: {'id': 'white', 'name': 'White'}, 41: 'chinese-laundry', 44: 'tshirts', 50: 925, 52: {'currencyName': 'GBP', 'priceAmount': '20.00', 'nationalShippingCost': '3.00'}, 61: {'id': 11, 'name': 'M', 'quantity': 1}, 62: {'id': 'brand_new', 'name': 'Brand new'}, 63: {'id': 'black', 'name': 'Black'}, 64: {'id': '90s', 'name': '90s'}, 67: 'chinese-laundry', 70: 'jackets'}})

Which looks like this:

    keys    values
3   brandId 925
5   price   {'currencyName': 'GBP', 'priceAmount': '50.00'...
14  sizes   {'id': 4, 'name': 'UK 4', 'quantity': 1}
18  brandId 925
20  price   {'currencyName': 'GBP', 'priceAmount': '11.00'...
29  sizes   {'id': 3, 'name': 'S', 'quantity': 1}
30  condition   {'id': 'used_like_new', 'name': 'Like new'}
...

I want to flatten out the dictionary for specific values belonging to their key. For example, grab only the value from priceAmount, and the value from name in any other dictionary key.

So expected output:

    keys           values
3   brandId        925
5   price          50.00
14  sizes          UK 4
18  brandId        925
20  price          11.00
29  sizes          S
30  condition      Like new}

I can do this with the following, which takes an awfully long time if I had more to replace!


price_data = []
for price in data[data['keys'].str.contains('price', na=False)].values:
    price_data.append(price[1]['priceAmount'])
    
condition_data = []
for condition in data[data['keys'].str.contains('condition', na=False)].values:
    condition_data.append(condition[1]['name'])
    
age_data = []
for age in data[data['keys'].str.contains('age', na=False)].values:
    age_data.append(age[1]['name'])
    
sizes_data = []
for sizes in data[data['keys'].str.contains('sizes', na=False)].values:
    sizes_data.append(sizes[1]['name'])

colour_data = []
for colour in data[data['keys'].str.contains('colour', na=False)].values:
    colour_data.append(colour[1]['name'])

#replace the values
data=data.replace(data[data['keys'].str.contains('price', na=False)]['values'].values, price_data) 
data=data.replace(data[data['keys'].str.contains('condition', na=False)]['values'].values, condition_data) 
data=data.replace(data[data['keys'].str.contains('age', na=False)]['values'].values, age_data) 
data=data.replace(data[data['keys'].str.contains('sizes', na=False)]['values'].values, sizes_data) 
data=data.replace(data[data['keys'].str.contains('colour', na=False)]['values'].values, colour_data) 

Is there a faster and smoother alternative to this?

CodePudding user response:

Perhaps, if you have access to the dict that you used to make the df, you can instead use json_normalize().

For example:

d = {
    'keys': {
        3: 'brandId', 5: 'price', 14: 'sizes', 18: 'brandId', 20: 'price', 29: 'sizes', 30: 'condition',
        31: 'condition', 32: 'colour', 33: 'age', 36: 'brand', 40: 'colour', 41: 'brand', 44: 'productType',
        50: 'brandId', 52: 'price', 61: 'sizes', 62: 'condition', 63: 'colour', 64: 'age', 67: 'brand',
        70: 'productType',
    },
    'values': {
        3: 925, 5: {'currencyName': 'GBP', 'priceAmount': '50.00', 'nationalShippingCost': '3.00'},
        14: {'id': 4, 'name': 'UK 4', 'quantity': 1}, 18: 925,
        20: {'currencyName': 'GBP', 'priceAmount': '11.00', 'nationalShippingCost': '0.00'},
        29: {'id': 3, 'name': 'S', 'quantity': 1}, 30: {'id': 'used_like_new', 'name': 'Like new'},
        31: {'id': 'brand_new', 'name': 'Brand new'}, 32: {'id': 'multi', 'name': 'Multi'}, 33:
        {'id': 'modern', 'name': 'Modern'}, 36: 'chinese-laundry', 40: {'id': 'white', 'name': 'White'},
        41: 'chinese-laundry', 44: 'tshirts', 50: 925,
        52: {'currencyName': 'GBP', 'priceAmount': '20.00', 'nationalShippingCost': '3.00'},
        61: {'id': 11, 'name': 'M', 'quantity': 1}, 62: {'id': 'brand_new', 'name': 'Brand new'},
        63: {'id': 'black', 'name': 'Black'}, 64: {'id': '90s', 'name': '90s'}, 67: 'chinese-laundry',
        70: 'jackets',
    },
}

Note that this dict is in a bit of an unusual form, with keys and values separated at the top level. In order to put them together, so that json_normalize() can be used, we want to merge them so that all records are complete (keys and values). That factors out the numeric keys for each. Note, since I assume there are many records in values (probably a list of dicts?), you would have to do that on each of them.

>>> mod_d = {d['keys'][i]: v for i, v in d['values'].items()}
>>> mod_d
{'brandId': 925,
 'price': {'currencyName': 'GBP',
  'priceAmount': '20.00',
  'nationalShippingCost': '3.00'},
 'sizes': {'id': 11, 'name': 'M', 'quantity': 1},
 'condition': {'id': 'brand_new', 'name': 'Brand new'},
 'colour': {'id': 'black', 'name': 'Black'},
 'age': {'id': '90s', 'name': '90s'},
 'brand': 'chinese-laundry',
 'productType': 'jackets'}

With this, we can now use json_normalize():

>>> df = pd.json_normalize(mod_d)
>>> df
   brandId            brand productType price.currencyName price.priceAmount  \
0      925  chinese-laundry     jackets                GBP             20.00   

  price.nationalShippingCost  sizes.id sizes.name  sizes.quantity  \
0                       3.00        11          M               1   

  condition.id condition.name colour.id colour.name age.id age.name  
0    brand_new      Brand new     black       Black    90s      90s  
, 

CodePudding user response:

pandas string methods allow accessing values from lists/tuples/dictionaries:

df['val'] = np.where(df['keys'] == 'price', 
                     df['values'].str['priceAmount'], 
                     df['values'].str['name'])

df['val'] = df['val'].fillna(df['values'])

 keys                                             values              val
3       brandId                                                925              925
5         price  {'currencyName': 'GBP', 'priceAmount': '50.00'...            50.00
14        sizes           {'id': 4, 'name': 'UK 4', 'quantity': 1}             UK 4
18      brandId                                                925              925
20        price  {'currencyName': 'GBP', 'priceAmount': '11.00'...            11.00
29        sizes              {'id': 3, 'name': 'S', 'quantity': 1}                S
30    condition        {'id': 'used_like_new', 'name': 'Like new'}         Like new
31    condition           {'id': 'brand_new', 'name': 'Brand new'}        Brand new
32       colour                   {'id': 'multi', 'name': 'Multi'}            Multi
33          age                 {'id': 'modern', 'name': 'Modern'}           Modern
36        brand                                    chinese-laundry  chinese-laundry
40       colour                   {'id': 'white', 'name': 'White'}            White
41        brand                                    chinese-laundry  chinese-laundry
44  productType                                            tshirts          tshirts
50      brandId                                                925              925
52        price  {'currencyName': 'GBP', 'priceAmount': '20.00'...            20.00
61        sizes             {'id': 11, 'name': 'M', 'quantity': 1}                M
62    condition           {'id': 'brand_new', 'name': 'Brand new'}        Brand new
63       colour                   {'id': 'black', 'name': 'Black'}            Black
64          age                       {'id': '90s', 'name': '90s'}              90s
67        brand                                    chinese-laundry  chinese-laundry
70  productType                                            jackets          jackets


CodePudding user response:

Another option is to use a simple list comprehension:

df['values'] = [i.get('priceAmount') or i.get('name') if isinstance(i, dict) else i for i in df['values'].tolist()]

Output:

           keys           values
3       brandId              925
5         price            50.00
14        sizes             UK 4
18      brandId              925
20        price            11.00
29        sizes                S
30    condition         Like new
31    condition        Brand new
32       colour            Multi
33          age           Modern
36        brand  chinese-laundry
40       colour            White
41        brand  chinese-laundry
44  productType          tshirts
50      brandId              925
52        price            20.00
61        sizes                M
62    condition        Brand new
63       colour            Black
64          age              90s
67        brand  chinese-laundry
70  productType          jackets
  • Related