Groupby in Pandas: How to group using a subgroup-CodePudding

My data is in a JSON file. This is how it is organized:

"summary": {
    "file_count": 2
  }, 
  "primary_site": "stomach", 
  "disease_type": "acid reflux", 
  "project": {
    "project_id": "Pro123"
  }, 
  "diagnoses": [
    {
      "primary_diagnosis": "GERD"
    }
  ], 
  "demographic": {
    "ethnicity": "not hispanic or latino", 
    "gender": "female", 
    "race": "Unknown"
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4236 entries, 0 to 4235
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   summary       4236 non-null   object
 1   primary_site  4236 non-null   object
 2   disease_type  4236 non-null   object
 3   project       4236 non-null   object
 4   diagnoses     4236 non-null   object
 5   demographic   4236 non-null   object
dtypes: object(6)

I'm wanting to group by disease type and gender but gender is a subgroup. How would I identify the subgroup in the command?

df2 = df.groupby('disease_type')['gender'].
print(df2)

CodePudding user response：

Passing all your groups into groupby

df.groupby(['disease_type','gender'])

CodePudding user response：

You might be interested in pandas.json_normalize or the answer from How to flatten a nested JSON recursively, with flatten_json

This will allow you to "flatten" your json and convert the resulting list into a dataframe after which you can then use groupby to obtain the data you're looking for.

def flatten_json(nested_json: dict, exclude: list=[''], sep: str='_') -> dict:
    """
    Flatten a list of nested dicts.
    https://stackoverflow.com/questions/58442723/how-to-flatten-a-nested-json-recursively-with-flatten-json
    """
    out = dict()
    def flatten(x: (list, dict, str), name: str='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude:
                    flatten(x[a], f'{name}{a}{sep}')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, f'{name}{i}{sep}')
                i  = 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out


df = pd.DataFrame([flatten_json(data)])
df.groupby(['disease_type','demographic_gender'])