Pandas JSON Table-CodePudding

I would like to use the first type object (always starts with "document_") from this json as a row and all others as a column

[
  [
    {
      "type": "document_type_bank_statement"
    },
    {
      "type": "sender_iban"
    },
    {
      "type": "sender_vat_id"
    }
  ],
  [
    {
      "type": "document_type_bank_statement"
    },
    {
      "type": "sender_iban"
    },
    {
      "type": "sender_vat_id"
    }
  ],
  [
    {
      "type": "document_type_invoice"
    },
    {
      "type": "sender_zip"
    }
  ]
]

Example:

	sender_iban	sender_vat_id	sender_zip
document_type_bank_statement	2	2	0
document_type_invoice	0	0	1

This will give me the first object:

for type in pagewise_data:
    print(type[0]['type'])

So all the others:

for type in pagewise_data:
    for i in type:
        if not i['type'].startswith('document'):
            print(i['type'])

This is how i get all types

for type in pagewise_data:
    for i in type:
        print(i['type'])

My question now is, how do i get this to work like in my given example table with pandas? so that you can also count the existing types that occur in an array block?

CodePudding user response：

We can start by loading our data using pd.DataFrame.from_dict.

>>> df = pd.DataFrame.from_dict(data)
>>> df                                            0  ...                          2
0  {'type': 'document_type_bank_statement'}  ...  {'type': 'sender_vat_id'}
1  {'type': 'document_type_bank_statement'}  ...  {'type': 'sender_vat_id'}
2         {'type': 'document_type_invoice'}  ...                       None

Then, we can .melt our dataframe to create pairs of values (first value as future ID and the second value that will be counted in a column at the end). We'll drop NaN values too:

>>> df = df.melt(id_vars=[0]).drop("variable", axis=1)
>>> df = df.dropna()
>>> df
                                          0                      value
0  {'type': 'document_type_bank_statement'}    {'type': 'sender_iban'}
1  {'type': 'document_type_bank_statement'}    {'type': 'sender_iban'}
2         {'type': 'document_type_invoice'}     {'type': 'sender_zip'}
3  {'type': 'document_type_bank_statement'}  {'type': 'sender_vat_id'}
4  {'type': 'document_type_bank_statement'}  {'type': 'sender_vat_id'}

Since each value is a dict with a "type" key, we'll use .applymap to take the corresponding value of that key for each cell in our dataframe:

>>> df = df.applymap(lambda d: d["type"])
>>> df
                              0          value
0  document_type_bank_statement    sender_iban
1  document_type_bank_statement    sender_iban
2         document_type_invoice     sender_zip
3  document_type_bank_statement  sender_vat_id
4  document_type_bank_statement  sender_vat_id

Finally, let's use pd.pivot_table to format our data and count the frequency of each term. Use the fill_value argument to fill the output with zeros. Also, we can format our index:

>>> df = pd.pivot_table(df, index=[0], columns=["value"], aggfunc=len, fill_value=0)
>>> df.index = df.index.map(lambda _id: _id[9:])
>>> df
value                sender_iban  sender_vat_id  sender_zip
0                                                          
type_bank_statement            2              2           0
type_invoice                   0              0           1

Resulting code:

df = pd.DataFrame.from_dict(data)
df = df.melt(id_vars=[0]).drop("variable", axis=1)
df = df.dropna()
df = df.applymap(lambda d: d["type"])
df = pd.pivot_table(df, index=[0], columns=["value"], aggfunc=len, fill_value=0)
df.index = df.index.map(lambda _id: _id[9:])

CodePudding user response：

Following code should work:

from collections import Counter
from itertools import chain
import pandas as pd

def extract_type(value):
    if value is not None:
        return value["type"]
    return value


def count_values(df):
    counter = Counter(
        list(chain.from_iterable([list(row[1]) for row in df.iterrows()]))
    )
    del counter[None]
    return pd.DataFrame.from_dict(counter.items()).set_index(0).T


original_data = pd.DataFrame.from_records(test_data)
extracted_types = original_data.applymap(extract_type).set_index(0)
counted_values = extracted_types.groupby(0).apply(count_values).fillna(0).droplevel(1)

If test_data is data you provided counted_values should contain:

0	sender_iban	sender_vat_id	sender_zip
document_type_bank_statement	2.0	2.0	0.0
document_type_invoice	0.0	0.0	1.0