I have been provided a very large dictionary with the following format that I am unsure how to convert to a dataframe that I can use to perform basic functions on.
{
'hash': {
'ids': [List of Unique IDs of records this hash has been seen in],
'weights': [List of weights],
'values': [List of values],
'measure_dates': [List of dates]
}
}
The number of items in ids
, weights
, values
and measure_dates
is the same within a hash
. Different hash
es can have a different number of items though. It depends on how often a measurement is taken.
Real(ish) data for an example of three records:
{
'IRR-99876-UTY': {
'ids': [9912234, 9912237, 45555889],
'weights': [0.09, 0.09, 0.113],
'values': [2.31220, 2.31219, 2.73944],
'measure_dates': ['2021-10-14', '2021-10-15', '2022-12-17']
},
'IRR-10881-CKZ': {
'ids': [45557231],
'weights': [0.31],
'values': [5.221001],
'measure_dates': ['2022-12-31']
},
'IRR-881-CKZ': {
'ids': [24661, 24662, 29431],
'weights': [0.05, 0.07, 0.105],
'values': [3.254, 4.500001, 7.3221],
'measure_dates': ['2018-05-05', '2018-05-06', '2018-07-01']
}
}
The value in an index corresponds to the same measurement being taken. For example in IRR-881-CKZ
, there are 3 measurements.
- Measurement 1 taken on 2018-05-05, with id
24661
, weight0.05
, and value3.254
- Measurement 2 taken on 2018-05-06, with id
24662
, weight0.07
and value4.500001
- Measurement 3 taken on 2018-07-01, with id
29431
, weight0.105
and value7.3221
No other combination of indexes is valid for this hash.
Information that I'm going to be attempting to get data on:
- Which hash(es) are measured the most often. This can be determined by which have the largest number of items in the
ids
list. In this example, the first and third record have three items so would be the top results. I'd love to be able to use something likenlargest()
orsort_values().head()
to get this, instead of parsing each record and counting the number of items. - Which hashes have an average value between two values. If I had a set number of columns, I think I'd be able to do something like
df['average'] = df[['value1', 'value2']].mean(axis=1)
, but with a variable number of values I'm not sure how to do this.
How can I convert this dictionary of dictionaries of lists to a usable dataframe?
CodePudding user response:
You can use .from_dict()
in pandas to convert it to a dataframe.
import pandas as pd
# dictionary of dictionaries with list values
data = {
'IRR-99876-UTY': {
'ids': [9912234, 9912237, 45555889],
'weights': [0.09, 0.09, 0.113],
'values': [2.31220, 2.31219, 2.73944],
'measure_dates': ['2021-10-14', '2021-10-15', '2022-12-17']
},
'IRR-10881-CKZ': {
'ids': [45557231],
'weights': [0.31],
'values': [5.221001],
'measure_dates': ['2022-12-31']
},
'IRR-881-CKZ': {
'ids': [24661, 24662, 29431],
'weights': [0.05, 0.07, 0.105],
'values': [3.254, 4.500001, 7.3221],
'measure_dates': ['2018-05-05', '2018-05-06', '2018-07-01']
}
}
# convert to data frame
df = pd.DataFrame.from_dict(data, orient='index')
CodePudding user response:
You'll need to convert each entry of this dictionary into its own DataFrame and concatenate those to effectively work with this data:
Creating a Usable DataFrame
import pandas as pd
data = {
'IRR-99876-UTY': {
'ids': [9912234, 9912237, 45555889],
'weights': [0.09, 0.09, 0.113],
'values': [2.31220, 2.31219, 2.73944],
'measure_dates': ['2021-10-14', '2021-10-15', '2022-12-17']
},
'IRR-10881-CKZ': {
'ids': [45557231],
'weights': [0.31],
'values': [5.221001],
'measure_dates': ['2022-12-31']
},
'IRR-881-CKZ': {
'ids': [24661, 24662, 29431],
'weights': [0.05, 0.07, 0.105],
'values': [3.254, 4.500001, 7.3221],
'measure_dates': ['2018-05-05', '2018-05-06', '2018-07-01']
}
}
df = pd.concat(
{k: pd.DataFrame(v) for k, v in data.items()},
names=['hash', 'obs']
)
print(df)
ids weights values measure_dates
hash obs
IRR-99876-UTY 0 9912234 0.090 2.312200 2021-10-14
1 9912237 0.090 2.312190 2021-10-15
2 45555889 0.113 2.739440 2022-12-17
IRR-10881-CKZ 0 45557231 0.310 5.221001 2022-12-31
IRR-881-CKZ 0 24661 0.050 3.254000 2018-05-05
1 24662 0.070 4.500001 2018-05-06
2 29431 0.105 7.322100 2018-07-01
Now that our data is cleaned up we can solve your questions.
Solving Your Questions
- Which hash(es) are measured the most often
- This is simply a
Series.value_counts
operation. However since the data we're interested in is currently in the index we'll need to grab it out usingIndex.get_level_values
first.
- Which hashes have an average value between two values.
- This is a groupby operation where we calculate the average from the "values" column per unique "hash". From there we can use the
Series.between
method to check whether a those averages exist between two arbitrary values.
# Which hash(es) are measured the most often.
df.index.get_level_values('hash').value_counts()
# IRR-99876-UTY 3
# IRR-881-CKZ 3
# IRR-10881-CKZ 1
# Name: hash, dtype: int64
# ---
# Which hashes have an average value between two values.
## Here you can see that I'm testing whether the average is between 0 and 4
print(df.groupby('hash')['values'].mean().between(0, 4))
# IRR-10881-CKZ False
# IRR-881-CKZ False
# IRR-99876-UTY True
# Name: values, dtype: bool