Home > Software engineering >  How to create a nested json with inner logic, using a csv file as source and pandas?
How to create a nested json with inner logic, using a csv file as source and pandas?

Time:08-06

Explanation: From a csv file it's needed convert it to a json file. The csv holds the following information: given an account this account must to pay 1 or n documents, one with the due date the current day and the others with dates before. The csv file has the following structure:

account customer_name due_date id_type id_document to_pay
2001 John Doe 05082022 FC 987654123 113,25
2002 Karl Steven 05082022 FC 987654124 113,25
2003 Sarah Miller 15022022 FC 987654125 199,75
2004 Marty Mcfly 15032022 FC 987654126 113,25
2001 John Doe 06032022 FC 987654999 150,00
2001 John Doe 15042022 FC 987654777 15,00
2003 Sarah Miller 05082022 FC 987444123 113,25
2001 John Doe 15032022 FC 987654123 113,25
account | customer_name | due_date | id_type | id_document | to_pay |
2001     | John Doe      | 05082022 | FC      | 987654123   | 113,25 |
2002     | Karl Steven   | 05082022 | FC      | 987654124   | 113,25 |
2003     | Sarah Miller  | 15022022 | FC      | 987654125   | 199,75 |
2004     | Marty Mcfly   | 15032022 | FC      | 987654126   | 113,25 |
2001     | John Doe      | 06032022 | FC      | 987654999   | 150,00 |
2001     | John Doe      | 15042022 | FC      | 987654777   | 15,00  |
2003     | Sarah Miller  | 05082022 | FC      | 987444123   | 113,25 |
2001     | John Doe      | 15032022 | FC      | 987654123   | 113,25 |

Desire Output: The output must be a multi-line json file that represent the logic mentioned above so taking just one line of the json, and with a pretty formatting, must look like this:

{
    "account": "2001",
    "customer_name": "John Doe",
    "due_date" : "05-08-2022",
    "id_type": "FC",
    "id_document": "987654123",
    "to_pay": "113,25",
    "pending_documents":[
        {
            "id_type" : "FC",
            "id_document" : "987654999",
            "to_pay" : "150,00",
            "due_date": "06-03-2022"
        },
        {
            "id_type" : "FC",
            "id_document" : "987654777",
            "to_pay" : "15,00",
            "due_date": "15-04-2022"
        },
        {
            "id_type" : "FC",
            "id_document" : "987654123",
            "to_pay" : "113,25",
            "due_date": "15-03-2022"
        }
    ]
}

What Have I tried: Using just python: (This piece of code it's not functional, but it's for the sake of a concrete approach)

with codecs.open(input_file, 'r', encoding='utf-8', errors='ignore') as in_file:
    for line in in_file:
        if pattern.search(line):
            line_object['account'] = line[2:10]
            cuentas_cto.append(line_object['cuenta'])
            line_object['due_date'] = datetime.strptime(line[90:98],'%d%m%Y').strftime('%Y-%m-%d')
            line_object['id_document'] = line[105:113]
            pattern_cuenta = re.compile(line_object['cuenta'])
            file_reopen = codecs.open(input_file, 'r', encoding='utf-8', errors='ignore')
            lines_file_reopen = file_reopen.readlines()
            for line in lines_file_reopen:
                if pattern_cuenta.search(line) and not pattern.search(line):
                    
                    inner_object = {
                                        "due_date" : "",
                                        "id_document" : ""
                                    }
                    inner_object['due_date'] = datetime.strptime(line[90:98],'%d%m%Y').strftime('%Y-%m-%d')
                    inner_object['id_document'] = line[105:113]
                    line_object['pending_documents'].append(inner_object)
            file_reopen.close()
            output.append(json.dumps(line_object))

This approach it's far from being a beauty solution but works. (works but takes a massive amount of time to complete). This is not a valid solution because each csv file has a size of 150 MB with around 1 000 000 lines each.

Why pandas: It's really fast for loading and processing so it's promising. Here is a really small POC:

>>> import pandas as pd
>>> df = pd.DataFrame(pd.read_csv("inCSV.csv", sep="|", header=0, dtype=str))
>>> df = pd.read_csv("inCSV.csv", sep="|", header=0, dtype=str)
>>> csv_file = pd.read_csv("inCSV.csv", sep="|", header=0, dtype=str)
>>> csv_file['pending_documents'] = csv_file[['id_document', 'id_document', 'due_date', 'doc_importe']].apply(lambda s: s.to_dict(), axis=1)
>>> csv_file[['account', 'pending_documents']].to_json("data-out-pd.json", orient="records")
>>> csv_file[['account', 'pending_documents']].to_json("data-out-pd.json", orient="records", lines=True)
>>> 

Here the problem is that I know really little about pandas and the documentation it's huge. Doing this task of data treatment is not always in my TO-DO list at work. It's clear that I need to study pandas. Thanks for reading. Any further information needed just let me know.

CodePudding user response:

You can use .groupby('account') to work with every account separatelly.

And when you will have one account then you can get first row ([0]) as main document, and other rows ([1:]) as pending_documents.

Using .to_dict() you can create dict for account - and later you can append it to some list to have all accounts.


Minimal working code with example data.

I use io to create file in memory - so everyone can simply copy and test it - but you can use your filename.

I use engine='python' so I can use regex in separator sep='\s \|\s*' and remove all spaces around values.

data = '''account | customer_name | due_date | id_type | id_document | to_pay |
2001     | John Doe      | 05082022 | FC      | 987654123   | 113,25 |
2002     | Karl Steven   | 05082022 | FC      | 987654124   | 113,25 |
2003     | Sarah Miller  | 15022022 | FC      | 987654125   | 199,75 |
2004     | Marty Mcfly   | 15032022 | FC      | 987654126   | 113,25 |
2001     | John Doe      | 06032022 | FC      | 987654999   | 150,00 |
2001     | John Doe      | 15042022 | FC      | 987654777   | 15,00  |
2003     | Sarah Miller  | 05082022 | FC      | 987444123   | 113,25 |
2001     | John Doe      | 15032022 | FC      | 987654123   | 113,25 |'''

import pandas as pd
import io

#df = pd.read_csv('filename.csv')
df = pd.read_csv(io.StringIO(data), sep='\s \|\s*', engine='python')
#print(df)
df = df.drop('Unnamed: 6', axis=1)  # remove column after last `|`
#print(df)

all_accounts = []

for selected, group in df.groupby(['account']):
    #print('---', selected, '---')
    account = group.iloc[0].to_dict()
    docs = group.iloc[1:][['id_type', 'id_document', 'to_pay', 'due_date']]
    account["pending_documents"] = docs.to_dict(orient='records')
    #print(account)
    all_accounts.append(account)

import json

print(json.dumps(all_accounts, indent=2))

Result:

[
  {
    "account": 2001,
    "customer_name": "John Doe",
    "due_date": 5082022,
    "id_type": "FC",
    "id_document": 987654123,
    "to_pay": "113,25",
    "pending_documents": [
      {
        "id_type": "FC",
        "id_document": 987654999,
        "to_pay": "150,00",
        "due_date": 6032022
      },
      {
        "id_type": "FC",
        "id_document": 987654777,
        "to_pay": "15,00",
        "due_date": 15042022
      },
      {
        "id_type": "FC",
        "id_document": 987654123,
        "to_pay": "113,25",
        "due_date": 15032022
      }
    ]
  },
  {
    "account": 2002,
    "customer_name": "Karl Steven",
    "due_date": 5082022,
    "id_type": "FC",
    "id_document": 987654124,
    "to_pay": "113,25",
    "pending_documents": []
  },
  {
    "account": 2003,
    "customer_name": "Sarah Miller",
    "due_date": 15022022,
    "id_type": "FC",
    "id_document": 987654125,
    "to_pay": "199,75",
    "pending_documents": [
      {
        "id_type": "FC",
        "id_document": 987444123,
        "to_pay": "113,25",
        "due_date": 5082022
      }
    ]
  },
  {
    "account": 2004,
    "customer_name": "Marty Mcfly",
    "due_date": 15032022,
    "id_type": "FC",
    "id_document": 987654126,
    "to_pay": "113,25",
    "pending_documents": []
  }
]
  • Related