Home > Net >  Pandas JSON Table
Pandas JSON Table

Time:03-08

I would like to use the first type object (always starts with "document_") from this json as a row and all others as a column

[
  [
    {
      "type": "document_type_bank_statement"
    },
    {
      "type": "sender_iban"
    },
    {
      "type": "sender_vat_id"
    }
  ],
  [
    {
      "type": "document_type_bank_statement"
    },
    {
      "type": "sender_iban"
    },
    {
      "type": "sender_vat_id"
    }
  ],
  [
    {
      "type": "document_type_invoice"
    },
    {
      "type": "sender_zip"
    }
  ]
]

Example:

sender_iban sender_vat_id sender_zip
document_type_bank_statement 2 2 0
document_type_invoice 0 0 1

This will give me the first object:

for type in pagewise_data:
    print(type[0]['type'])

So all the others:

for type in pagewise_data:
    for i in type:
        if not i['type'].startswith('document'):
            print(i['type'])

This is how i get all types

for type in pagewise_data:
    for i in type:
        print(i['type'])

My question now is, how do i get this to work like in my given example table with pandas? so that you can also count the existing types that occur in an array block?

CodePudding user response:

We can start by loading our data using pd.DataFrame.from_dict.

>>> df = pd.DataFrame.from_dict(data)
>>> df                                            0  ...                          2
0  {'type': 'document_type_bank_statement'}  ...  {'type': 'sender_vat_id'}
1  {'type': 'document_type_bank_statement'}  ...  {'type': 'sender_vat_id'}
2         {'type': 'document_type_invoice'}  ...                       None

Then, we can .melt our dataframe to create pairs of values (first value as future ID and the second value that will be counted in a column at the end). We'll drop NaN values too:

>>> df = df.melt(id_vars=[0]).drop("variable", axis=1)
>>> df = df.dropna()
>>> df
                                          0                      value
0  {'type': 'document_type_bank_statement'}    {'type': 'sender_iban'}
1  {'type': 'document_type_bank_statement'}    {'type': 'sender_iban'}
2         {'type': 'document_type_invoice'}     {'type': 'sender_zip'}
3  {'type': 'document_type_bank_statement'}  {'type': 'sender_vat_id'}
4  {'type': 'document_type_bank_statement'}  {'type': 'sender_vat_id'}

Since each value is a dict with a "type" key, we'll use .applymap to take the corresponding value of that key for each cell in our dataframe:

>>> df = df.applymap(lambda d: d["type"])
>>> df
                              0          value
0  document_type_bank_statement    sender_iban
1  document_type_bank_statement    sender_iban
2         document_type_invoice     sender_zip
3  document_type_bank_statement  sender_vat_id
4  document_type_bank_statement  sender_vat_id

Finally, let's use pd.pivot_table to format our data and count the frequency of each term. Use the fill_value argument to fill the output with zeros. Also, we can format our index:

>>> df = pd.pivot_table(df, index=[0], columns=["value"], aggfunc=len, fill_value=0)
>>> df.index = df.index.map(lambda _id: _id[9:])
>>> df
value                sender_iban  sender_vat_id  sender_zip
0                                                          
type_bank_statement            2              2           0
type_invoice                   0              0           1

Resulting code:

df = pd.DataFrame.from_dict(data)
df = df.melt(id_vars=[0]).drop("variable", axis=1)
df = df.dropna()
df = df.applymap(lambda d: d["type"])
df = pd.pivot_table(df, index=[0], columns=["value"], aggfunc=len, fill_value=0)
df.index = df.index.map(lambda _id: _id[9:])

CodePudding user response:

Following code should work:

from collections import Counter
from itertools import chain
import pandas as pd

def extract_type(value):
    if value is not None:
        return value["type"]
    return value


def count_values(df):
    counter = Counter(
        list(chain.from_iterable([list(row[1]) for row in df.iterrows()]))
    )
    del counter[None]
    return pd.DataFrame.from_dict(counter.items()).set_index(0).T


original_data = pd.DataFrame.from_records(test_data)
extracted_types = original_data.applymap(extract_type).set_index(0)
counted_values = extracted_types.groupby(0).apply(count_values).fillna(0).droplevel(1)

If test_data is data you provided counted_values should contain:

0 sender_iban sender_vat_id sender_zip
document_type_bank_statement 2.0 2.0 0.0
document_type_invoice 0.0 0.0 1.0
  • Related