My data is in a JSON file. This is how it is organized:
"summary": {
"file_count": 2
},
"primary_site": "stomach",
"disease_type": "acid reflux",
"project": {
"project_id": "Pro123"
},
"diagnoses": [
{
"primary_diagnosis": "GERD"
}
],
"demographic": {
"ethnicity": "not hispanic or latino",
"gender": "female",
"race": "Unknown"
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4236 entries, 0 to 4235
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 summary 4236 non-null object
1 primary_site 4236 non-null object
2 disease_type 4236 non-null object
3 project 4236 non-null object
4 diagnoses 4236 non-null object
5 demographic 4236 non-null object
dtypes: object(6)
I'm wanting to group by disease type and gender but gender is a subgroup. How would I identify the subgroup in the command?
df2 = df.groupby('disease_type')['gender'].
print(df2)
CodePudding user response:
Passing all your groups into groupby
df.groupby(['disease_type','gender'])
CodePudding user response:
You might be interested in pandas.json_normalize or the answer from How to flatten a nested JSON recursively, with flatten_json
This will allow you to "flatten" your json and convert the resulting list into a dataframe after which you can then use groupby to obtain the data you're looking for.
def flatten_json(nested_json: dict, exclude: list=[''], sep: str='_') -> dict:
"""
Flatten a list of nested dicts.
https://stackoverflow.com/questions/58442723/how-to-flatten-a-nested-json-recursively-with-flatten-json
"""
out = dict()
def flatten(x: (list, dict, str), name: str='', exclude=exclude):
if type(x) is dict:
for a in x:
if a not in exclude:
flatten(x[a], f'{name}{a}{sep}')
elif type(x) is list:
i = 0
for a in x:
flatten(a, f'{name}{i}{sep}')
i = 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
df = pd.DataFrame([flatten_json(data)])
df.groupby(['disease_type','demographic_gender'])