Home > front end >  Convert Dictionary of Dictionaries with list values to a data frame
Convert Dictionary of Dictionaries with list values to a data frame

Time:01-13

I have been provided a very large dictionary with the following format that I am unsure how to convert to a dataframe that I can use to perform basic functions on.

{
    'hash': {
        'ids': [List of Unique IDs of records this hash has been seen in],
        'weights': [List of weights],
        'values': [List of values],
        'measure_dates': [List of dates]
    }
}

The number of items in ids, weights, values and measure_dates is the same within a hash. Different hashes can have a different number of items though. It depends on how often a measurement is taken.

Real(ish) data for an example of three records:

{
    'IRR-99876-UTY': {
        'ids': [9912234, 9912237, 45555889],
        'weights': [0.09, 0.09, 0.113],
        'values': [2.31220, 2.31219, 2.73944],
        'measure_dates': ['2021-10-14', '2021-10-15', '2022-12-17']
    },
    'IRR-10881-CKZ': {
        'ids': [45557231],
        'weights': [0.31],
        'values': [5.221001],
        'measure_dates': ['2022-12-31']
    },
    'IRR-881-CKZ': {
        'ids': [24661, 24662, 29431],
        'weights': [0.05, 0.07, 0.105],
        'values': [3.254, 4.500001, 7.3221],
        'measure_dates': ['2018-05-05', '2018-05-06', '2018-07-01']
    }
}

The value in an index corresponds to the same measurement being taken. For example in IRR-881-CKZ, there are 3 measurements.

  • Measurement 1 taken on 2018-05-05, with id 24661, weight 0.05, and value 3.254
  • Measurement 2 taken on 2018-05-06, with id 24662, weight 0.07 and value 4.500001
  • Measurement 3 taken on 2018-07-01, with id 29431, weight 0.105 and value 7.3221

No other combination of indexes is valid for this hash.

Information that I'm going to be attempting to get data on:

  • Which hash(es) are measured the most often. This can be determined by which have the largest number of items in the ids list. In this example, the first and third record have three items so would be the top results. I'd love to be able to use something like nlargest() or sort_values().head() to get this, instead of parsing each record and counting the number of items.
  • Which hashes have an average value between two values. If I had a set number of columns, I think I'd be able to do something like df['average'] = df[['value1', 'value2']].mean(axis=1), but with a variable number of values I'm not sure how to do this.

How can I convert this dictionary of dictionaries of lists to a usable dataframe?

CodePudding user response:

You can use .from_dict() in pandas to convert it to a dataframe.

import pandas as pd

# dictionary of dictionaries with list values
data = {
    'IRR-99876-UTY': {
        'ids': [9912234, 9912237, 45555889],
        'weights': [0.09, 0.09, 0.113],
        'values': [2.31220, 2.31219, 2.73944],
        'measure_dates': ['2021-10-14', '2021-10-15', '2022-12-17']
    },
    'IRR-10881-CKZ': {
        'ids': [45557231],
        'weights': [0.31],
        'values': [5.221001],
        'measure_dates': ['2022-12-31']
    },
    'IRR-881-CKZ': {
        'ids': [24661, 24662, 29431],
        'weights': [0.05, 0.07, 0.105],
        'values': [3.254, 4.500001, 7.3221],
        'measure_dates': ['2018-05-05', '2018-05-06', '2018-07-01']
    }
}

# convert to data frame
df = pd.DataFrame.from_dict(data, orient='index')

CodePudding user response:

You'll need to convert each entry of this dictionary into its own DataFrame and concatenate those to effectively work with this data:

Creating a Usable DataFrame

import pandas as pd

data = {
    'IRR-99876-UTY': {
        'ids': [9912234, 9912237, 45555889],
        'weights': [0.09, 0.09, 0.113],
        'values': [2.31220, 2.31219, 2.73944],
        'measure_dates': ['2021-10-14', '2021-10-15', '2022-12-17']
    },
    'IRR-10881-CKZ': {
        'ids': [45557231],
        'weights': [0.31],
        'values': [5.221001],
        'measure_dates': ['2022-12-31']
    },
    'IRR-881-CKZ': {
        'ids': [24661, 24662, 29431],
        'weights': [0.05, 0.07, 0.105],
        'values': [3.254, 4.500001, 7.3221],
        'measure_dates': ['2018-05-05', '2018-05-06', '2018-07-01']
    }
}

df = pd.concat(
    {k: pd.DataFrame(v) for k, v in data.items()}, 
    names=['hash', 'obs']
)

print(df)
                        ids  weights    values measure_dates
hash          obs                                           
IRR-99876-UTY 0     9912234    0.090  2.312200    2021-10-14
              1     9912237    0.090  2.312190    2021-10-15
              2    45555889    0.113  2.739440    2022-12-17
IRR-10881-CKZ 0    45557231    0.310  5.221001    2022-12-31
IRR-881-CKZ   0       24661    0.050  3.254000    2018-05-05
              1       24662    0.070  4.500001    2018-05-06
              2       29431    0.105  7.322100    2018-07-01

Now that our data is cleaned up we can solve your questions.

Solving Your Questions

  1. Which hash(es) are measured the most often
  1. Which hashes have an average value between two values.
  • This is a groupby operation where we calculate the average from the "values" column per unique "hash". From there we can use the Series.between method to check whether a those averages exist between two arbitrary values.
# Which hash(es) are measured the most often.
df.index.get_level_values('hash').value_counts()

# IRR-99876-UTY    3
# IRR-881-CKZ      3
# IRR-10881-CKZ    1
# Name: hash, dtype: int64

# ---
# Which hashes have an average value between two values.
## Here you can see that I'm testing whether the average is between 0 and 4
print(df.groupby('hash')['values'].mean().between(0, 4))

# IRR-10881-CKZ    False
# IRR-881-CKZ      False
# IRR-99876-UTY     True
# Name: values, dtype: bool
  • Related