How to create a nested json with inner logic, using a csv file as source and pandas?-CodePudding

Explanation: From a csv file it's needed convert it to a json file. The csv holds the following information: given an account this account must to pay 1 or n documents, one with the due date the current day and the others with dates before. The csv file has the following structure:

account	customer_name	due_date	id_type	id_document	to_pay
2001	John Doe	05082022	FC	987654123	113,25
2002	Karl Steven	05082022	FC	987654124	113,25
2003	Sarah Miller	15022022	FC	987654125	199,75
2004	Marty Mcfly	15032022	FC	987654126	113,25
2001	John Doe	06032022	FC	987654999	150,00
2001	John Doe	15042022	FC	987654777	15,00
2003	Sarah Miller	05082022	FC	987444123	113,25
2001	John Doe	15032022	FC	987654123	113,25

account | customer_name | due_date | id_type | id_document | to_pay |
2001     | John Doe      | 05082022 | FC      | 987654123   | 113,25 |
2002     | Karl Steven   | 05082022 | FC      | 987654124   | 113,25 |
2003     | Sarah Miller  | 15022022 | FC      | 987654125   | 199,75 |
2004     | Marty Mcfly   | 15032022 | FC      | 987654126   | 113,25 |
2001     | John Doe      | 06032022 | FC      | 987654999   | 150,00 |
2001     | John Doe      | 15042022 | FC      | 987654777   | 15,00  |
2003     | Sarah Miller  | 05082022 | FC      | 987444123   | 113,25 |
2001     | John Doe      | 15032022 | FC      | 987654123   | 113,25 |

Desire Output: The output must be a multi-line json file that represent the logic mentioned above so taking just one line of the json, and with a pretty formatting, must look like this:

{
    "account": "2001",
    "customer_name": "John Doe",
    "due_date" : "05-08-2022",
    "id_type": "FC",
    "id_document": "987654123",
    "to_pay": "113,25",
    "pending_documents":[
        {
            "id_type" : "FC",
            "id_document" : "987654999",
            "to_pay" : "150,00",
            "due_date": "06-03-2022"
        },
        {
            "id_type" : "FC",
            "id_document" : "987654777",
            "to_pay" : "15,00",
            "due_date": "15-04-2022"
        },
        {
            "id_type" : "FC",
            "id_document" : "987654123",
            "to_pay" : "113,25",
            "due_date": "15-03-2022"
        }
    ]
}

What Have I tried: Using just python: (This piece of code it's not functional, but it's for the sake of a concrete approach)

with codecs.open(input_file, 'r', encoding='utf-8', errors='ignore') as in_file:
    for line in in_file:
        if pattern.search(line):
            line_object['account'] = line[2:10]
            cuentas_cto.append(line_object['cuenta'])
            line_object['due_date'] = datetime.strptime(line[90:98],'%d%m%Y').strftime('%Y-%m-%d')
            line_object['id_document'] = line[105:113]
            pattern_cuenta = re.compile(line_object['cuenta'])
            file_reopen = codecs.open(input_file, 'r', encoding='utf-8', errors='ignore')
            lines_file_reopen = file_reopen.readlines()
            for line in lines_file_reopen:
                if pattern_cuenta.search(line) and not pattern.search(line):
                    
                    inner_object = {
                                        "due_date" : "",
                                        "id_document" : ""
                                    }
                    inner_object['due_date'] = datetime.strptime(line[90:98],'%d%m%Y').strftime('%Y-%m-%d')
                    inner_object['id_document'] = line[105:113]
                    line_object['pending_documents'].append(inner_object)
            file_reopen.close()
            output.append(json.dumps(line_object))

This approach it's far from being a beauty solution but works. (works but takes a massive amount of time to complete). This is not a valid solution because each csv file has a size of 150 MB with around 1 000 000 lines each.

Why pandas: It's really fast for loading and processing so it's promising. Here is a really small POC:

>>> import pandas as pd
>>> df = pd.DataFrame(pd.read_csv("inCSV.csv", sep="|", header=0, dtype=str))
>>> df = pd.read_csv("inCSV.csv", sep="|", header=0, dtype=str)
>>> csv_file = pd.read_csv("inCSV.csv", sep="|", header=0, dtype=str)
>>> csv_file['pending_documents'] = csv_file[['id_document', 'id_document', 'due_date', 'doc_importe']].apply(lambda s: s.to_dict(), axis=1)
>>> csv_file[['account', 'pending_documents']].to_json("data-out-pd.json", orient="records")
>>> csv_file[['account', 'pending_documents']].to_json("data-out-pd.json", orient="records", lines=True)
>>>

Here the problem is that I know really little about pandas and the documentation it's huge. Doing this task of data treatment is not always in my TO-DO list at work. It's clear that I need to study pandas. Thanks for reading. Any further information needed just let me know.

CodePudding user response：

You can use .groupby('account') to work with every account separatelly.

And when you will have one account then you can get first row ([0]) as main document, and other rows ([1:]) as pending_documents.

Using .to_dict() you can create dict for account - and later you can append it to some list to have all accounts.

Minimal working code with example data.

I use io to create file in memory - so everyone can simply copy and test it - but you can use your filename.

I use engine='python' so I can use regex in separator sep='\s \|\s*' and remove all spaces around values.

data = '''account | customer_name | due_date | id_type | id_document | to_pay |
2001     | John Doe      | 05082022 | FC      | 987654123   | 113,25 |
2002     | Karl Steven   | 05082022 | FC      | 987654124   | 113,25 |
2003     | Sarah Miller  | 15022022 | FC      | 987654125   | 199,75 |
2004     | Marty Mcfly   | 15032022 | FC      | 987654126   | 113,25 |
2001     | John Doe      | 06032022 | FC      | 987654999   | 150,00 |
2001     | John Doe      | 15042022 | FC      | 987654777   | 15,00  |
2003     | Sarah Miller  | 05082022 | FC      | 987444123   | 113,25 |
2001     | John Doe      | 15032022 | FC      | 987654123   | 113,25 |'''

import pandas as pd
import io

#df = pd.read_csv('filename.csv')
df = pd.read_csv(io.StringIO(data), sep='\s \|\s*', engine='python')
#print(df)
df = df.drop('Unnamed: 6', axis=1)  # remove column after last `|`
#print(df)

all_accounts = []

for selected, group in df.groupby(['account']):
    #print('---', selected, '---')
    account = group.iloc[0].to_dict()
    docs = group.iloc[1:][['id_type', 'id_document', 'to_pay', 'due_date']]
    account["pending_documents"] = docs.to_dict(orient='records')
    #print(account)
    all_accounts.append(account)

import json

print(json.dumps(all_accounts, indent=2))

Result:

[
  {
    "account": 2001,
    "customer_name": "John Doe",
    "due_date": 5082022,
    "id_type": "FC",
    "id_document": 987654123,
    "to_pay": "113,25",
    "pending_documents": [
      {
        "id_type": "FC",
        "id_document": 987654999,
        "to_pay": "150,00",
        "due_date": 6032022
      },
      {
        "id_type": "FC",
        "id_document": 987654777,
        "to_pay": "15,00",
        "due_date": 15042022
      },
      {
        "id_type": "FC",
        "id_document": 987654123,
        "to_pay": "113,25",
        "due_date": 15032022
      }
    ]
  },
  {
    "account": 2002,
    "customer_name": "Karl Steven",
    "due_date": 5082022,
    "id_type": "FC",
    "id_document": 987654124,
    "to_pay": "113,25",
    "pending_documents": []
  },
  {
    "account": 2003,
    "customer_name": "Sarah Miller",
    "due_date": 15022022,
    "id_type": "FC",
    "id_document": 987654125,
    "to_pay": "199,75",
    "pending_documents": [
      {
        "id_type": "FC",
        "id_document": 987444123,
        "to_pay": "113,25",
        "due_date": 5082022
      }
    ]
  },
  {
    "account": 2004,
    "customer_name": "Marty Mcfly",
    "due_date": 15032022,
    "id_type": "FC",
    "id_document": 987654126,
    "to_pay": "113,25",
    "pending_documents": []
  }
]