Explanation: From a csv file it's needed convert it to a json file. The csv holds the following information: given an account this account must to pay 1 or n documents, one with the due date the current day and the others with dates before. The csv file has the following structure:
account | customer_name | due_date | id_type | id_document | to_pay |
---|---|---|---|---|---|
2001 | John Doe | 05082022 | FC | 987654123 | 113,25 |
2002 | Karl Steven | 05082022 | FC | 987654124 | 113,25 |
2003 | Sarah Miller | 15022022 | FC | 987654125 | 199,75 |
2004 | Marty Mcfly | 15032022 | FC | 987654126 | 113,25 |
2001 | John Doe | 06032022 | FC | 987654999 | 150,00 |
2001 | John Doe | 15042022 | FC | 987654777 | 15,00 |
2003 | Sarah Miller | 05082022 | FC | 987444123 | 113,25 |
2001 | John Doe | 15032022 | FC | 987654123 | 113,25 |
account | customer_name | due_date | id_type | id_document | to_pay |
2001 | John Doe | 05082022 | FC | 987654123 | 113,25 |
2002 | Karl Steven | 05082022 | FC | 987654124 | 113,25 |
2003 | Sarah Miller | 15022022 | FC | 987654125 | 199,75 |
2004 | Marty Mcfly | 15032022 | FC | 987654126 | 113,25 |
2001 | John Doe | 06032022 | FC | 987654999 | 150,00 |
2001 | John Doe | 15042022 | FC | 987654777 | 15,00 |
2003 | Sarah Miller | 05082022 | FC | 987444123 | 113,25 |
2001 | John Doe | 15032022 | FC | 987654123 | 113,25 |
Desire Output: The output must be a multi-line json file that represent the logic mentioned above so taking just one line of the json, and with a pretty formatting, must look like this:
{
"account": "2001",
"customer_name": "John Doe",
"due_date" : "05-08-2022",
"id_type": "FC",
"id_document": "987654123",
"to_pay": "113,25",
"pending_documents":[
{
"id_type" : "FC",
"id_document" : "987654999",
"to_pay" : "150,00",
"due_date": "06-03-2022"
},
{
"id_type" : "FC",
"id_document" : "987654777",
"to_pay" : "15,00",
"due_date": "15-04-2022"
},
{
"id_type" : "FC",
"id_document" : "987654123",
"to_pay" : "113,25",
"due_date": "15-03-2022"
}
]
}
What Have I tried: Using just python: (This piece of code it's not functional, but it's for the sake of a concrete approach)
with codecs.open(input_file, 'r', encoding='utf-8', errors='ignore') as in_file:
for line in in_file:
if pattern.search(line):
line_object['account'] = line[2:10]
cuentas_cto.append(line_object['cuenta'])
line_object['due_date'] = datetime.strptime(line[90:98],'%d%m%Y').strftime('%Y-%m-%d')
line_object['id_document'] = line[105:113]
pattern_cuenta = re.compile(line_object['cuenta'])
file_reopen = codecs.open(input_file, 'r', encoding='utf-8', errors='ignore')
lines_file_reopen = file_reopen.readlines()
for line in lines_file_reopen:
if pattern_cuenta.search(line) and not pattern.search(line):
inner_object = {
"due_date" : "",
"id_document" : ""
}
inner_object['due_date'] = datetime.strptime(line[90:98],'%d%m%Y').strftime('%Y-%m-%d')
inner_object['id_document'] = line[105:113]
line_object['pending_documents'].append(inner_object)
file_reopen.close()
output.append(json.dumps(line_object))
This approach it's far from being a beauty solution but works. (works but takes a massive amount of time to complete). This is not a valid solution because each csv file has a size of 150 MB with around 1 000 000 lines each.
Why pandas: It's really fast for loading and processing so it's promising. Here is a really small POC:
>>> import pandas as pd
>>> df = pd.DataFrame(pd.read_csv("inCSV.csv", sep="|", header=0, dtype=str))
>>> df = pd.read_csv("inCSV.csv", sep="|", header=0, dtype=str)
>>> csv_file = pd.read_csv("inCSV.csv", sep="|", header=0, dtype=str)
>>> csv_file['pending_documents'] = csv_file[['id_document', 'id_document', 'due_date', 'doc_importe']].apply(lambda s: s.to_dict(), axis=1)
>>> csv_file[['account', 'pending_documents']].to_json("data-out-pd.json", orient="records")
>>> csv_file[['account', 'pending_documents']].to_json("data-out-pd.json", orient="records", lines=True)
>>>
Here the problem is that I know really little about pandas and the documentation it's huge. Doing this task of data treatment is not always in my TO-DO list at work. It's clear that I need to study pandas. Thanks for reading. Any further information needed just let me know.
CodePudding user response:
You can use .groupby('account')
to work with every account separatelly.
And when you will have one account then you can get first row ([0]
) as main document, and other rows ([1:]
) as pending_documents
.
Using .to_dict()
you can create dict for account - and later you can append it to some list to have all accounts.
Minimal working code with example data.
I use io
to create file in memory - so everyone can simply copy and test it - but you can use your filename
.
I use engine='python'
so I can use regex
in separator sep='\s \|\s*'
and remove all spaces around values.
data = '''account | customer_name | due_date | id_type | id_document | to_pay |
2001 | John Doe | 05082022 | FC | 987654123 | 113,25 |
2002 | Karl Steven | 05082022 | FC | 987654124 | 113,25 |
2003 | Sarah Miller | 15022022 | FC | 987654125 | 199,75 |
2004 | Marty Mcfly | 15032022 | FC | 987654126 | 113,25 |
2001 | John Doe | 06032022 | FC | 987654999 | 150,00 |
2001 | John Doe | 15042022 | FC | 987654777 | 15,00 |
2003 | Sarah Miller | 05082022 | FC | 987444123 | 113,25 |
2001 | John Doe | 15032022 | FC | 987654123 | 113,25 |'''
import pandas as pd
import io
#df = pd.read_csv('filename.csv')
df = pd.read_csv(io.StringIO(data), sep='\s \|\s*', engine='python')
#print(df)
df = df.drop('Unnamed: 6', axis=1) # remove column after last `|`
#print(df)
all_accounts = []
for selected, group in df.groupby(['account']):
#print('---', selected, '---')
account = group.iloc[0].to_dict()
docs = group.iloc[1:][['id_type', 'id_document', 'to_pay', 'due_date']]
account["pending_documents"] = docs.to_dict(orient='records')
#print(account)
all_accounts.append(account)
import json
print(json.dumps(all_accounts, indent=2))
Result:
[
{
"account": 2001,
"customer_name": "John Doe",
"due_date": 5082022,
"id_type": "FC",
"id_document": 987654123,
"to_pay": "113,25",
"pending_documents": [
{
"id_type": "FC",
"id_document": 987654999,
"to_pay": "150,00",
"due_date": 6032022
},
{
"id_type": "FC",
"id_document": 987654777,
"to_pay": "15,00",
"due_date": 15042022
},
{
"id_type": "FC",
"id_document": 987654123,
"to_pay": "113,25",
"due_date": 15032022
}
]
},
{
"account": 2002,
"customer_name": "Karl Steven",
"due_date": 5082022,
"id_type": "FC",
"id_document": 987654124,
"to_pay": "113,25",
"pending_documents": []
},
{
"account": 2003,
"customer_name": "Sarah Miller",
"due_date": 15022022,
"id_type": "FC",
"id_document": 987654125,
"to_pay": "199,75",
"pending_documents": [
{
"id_type": "FC",
"id_document": 987444123,
"to_pay": "113,25",
"due_date": 5082022
}
]
},
{
"account": 2004,
"customer_name": "Marty Mcfly",
"due_date": 15032022,
"id_type": "FC",
"id_document": 987654126,
"to_pay": "113,25",
"pending_documents": []
}
]