I would like to use the first type
object (always starts with "document_") from this json as a row and all others as a column
[
[
{
"type": "document_type_bank_statement"
},
{
"type": "sender_iban"
},
{
"type": "sender_vat_id"
}
],
[
{
"type": "document_type_bank_statement"
},
{
"type": "sender_iban"
},
{
"type": "sender_vat_id"
}
],
[
{
"type": "document_type_invoice"
},
{
"type": "sender_zip"
}
]
]
Example:
sender_iban | sender_vat_id | sender_zip | |
---|---|---|---|
document_type_bank_statement | 2 | 2 | 0 |
document_type_invoice | 0 | 0 | 1 |
This will give me the first object:
for type in pagewise_data:
print(type[0]['type'])
So all the others:
for type in pagewise_data:
for i in type:
if not i['type'].startswith('document'):
print(i['type'])
This is how i get all types
for type in pagewise_data:
for i in type:
print(i['type'])
My question now is, how do i get this to work like in my given example table with pandas? so that you can also count the existing types
that occur in an array block?
CodePudding user response:
We can start by loading our data using pd.DataFrame.from_dict
.
>>> df = pd.DataFrame.from_dict(data)
>>> df 0 ... 2
0 {'type': 'document_type_bank_statement'} ... {'type': 'sender_vat_id'}
1 {'type': 'document_type_bank_statement'} ... {'type': 'sender_vat_id'}
2 {'type': 'document_type_invoice'} ... None
Then, we can .melt
our dataframe to create pairs of values (first value as future ID and the second value that will be counted in a column at the end). We'll drop NaN values too:
>>> df = df.melt(id_vars=[0]).drop("variable", axis=1)
>>> df = df.dropna()
>>> df
0 value
0 {'type': 'document_type_bank_statement'} {'type': 'sender_iban'}
1 {'type': 'document_type_bank_statement'} {'type': 'sender_iban'}
2 {'type': 'document_type_invoice'} {'type': 'sender_zip'}
3 {'type': 'document_type_bank_statement'} {'type': 'sender_vat_id'}
4 {'type': 'document_type_bank_statement'} {'type': 'sender_vat_id'}
Since each value is a dict
with a "type"
key, we'll use .applymap
to take the corresponding value of that key for each cell in our dataframe:
>>> df = df.applymap(lambda d: d["type"])
>>> df
0 value
0 document_type_bank_statement sender_iban
1 document_type_bank_statement sender_iban
2 document_type_invoice sender_zip
3 document_type_bank_statement sender_vat_id
4 document_type_bank_statement sender_vat_id
Finally, let's use pd.pivot_table
to format our data and count the frequency of each term. Use the fill_value
argument to fill the output with zeros. Also, we can format our index:
>>> df = pd.pivot_table(df, index=[0], columns=["value"], aggfunc=len, fill_value=0)
>>> df.index = df.index.map(lambda _id: _id[9:])
>>> df
value sender_iban sender_vat_id sender_zip
0
type_bank_statement 2 2 0
type_invoice 0 0 1
Resulting code:
df = pd.DataFrame.from_dict(data)
df = df.melt(id_vars=[0]).drop("variable", axis=1)
df = df.dropna()
df = df.applymap(lambda d: d["type"])
df = pd.pivot_table(df, index=[0], columns=["value"], aggfunc=len, fill_value=0)
df.index = df.index.map(lambda _id: _id[9:])
CodePudding user response:
Following code should work:
from collections import Counter
from itertools import chain
import pandas as pd
def extract_type(value):
if value is not None:
return value["type"]
return value
def count_values(df):
counter = Counter(
list(chain.from_iterable([list(row[1]) for row in df.iterrows()]))
)
del counter[None]
return pd.DataFrame.from_dict(counter.items()).set_index(0).T
original_data = pd.DataFrame.from_records(test_data)
extracted_types = original_data.applymap(extract_type).set_index(0)
counted_values = extracted_types.groupby(0).apply(count_values).fillna(0).droplevel(1)
If test_data is data you provided counted_values should contain:
0 | sender_iban | sender_vat_id | sender_zip |
---|---|---|---|
document_type_bank_statement | 2.0 | 2.0 | 0.0 |
document_type_invoice | 0.0 | 0.0 | 1.0 |