I have a text file with the following format.
Order|AA|BB|CC|DD
2|status1|Cdd.int|true|false
12|status2|Cdd.String|true|false
1|status3|Cdd.Float|false|true
I would like to read this text file
I would like to append also metadata with an empty value
I am only interested in
Order, AA, BB, and CC
with sort by order and then parse it into JSON format as follows.
The expected output looks like the following.
{
"fields": [
{
"metadata": {},
"name" : "status3",
"type" : "Float",
"nullable" : false
},
{
"metadata": {},
"name" : "status1",
"type" : "int",
"nullable" : true
},
{
"metadata": {},
"name" : "status2",
"type" : "String",
"nullable" : true
}
],
'type':'struct'
}
Can anyone help with this?
CodePudding user response:
Assuming we have the data stored in a txt file file.txt
as follows:
Order|AA|BB|CC|DD
2|status1|Cdd.int|true|false
12|status2|Cdd.String|true|false
1|status3|Cdd.Float|false|true
The following code does what you need (explained in the comments of the code itself):
import pandas as pd
import json
#read csv
df = pd.read_csv("file.txt", sep = "|")
headers = {"AA": "name",
"BB": "type",
"CC": "nullable"
}
#Drop columns which are not in headers dict
df.drop([c for c in df.columns if c not in headers.keys()], inplace=True, axis=1)
#Rename columns based on headers dict
df.rename(columns = headers, inplace = True)
#Format columns
df["type"] = df["type"].str.split(".").str[1].str.lower()
#Build your final dict
output = {"fields": [], "type": "struct"}
for n, row in df.iterrows():
data_dict = {"metadata": {}}
data_dict.update(row.to_dict())
output["fields"].append(data_dict)
#Save json
with open("output.json", "w") as f:
json.dump(output, f, indent = 4)
The output json (output.json
) is as follows:
{
"fields": [
{
"metadata": {},
"name": "status1",
"type": "int",
"nullable": true
},
{
"metadata": {},
"name": "status2",
"type": "string",
"nullable": true
},
{
"metadata": {},
"name": "status3",
"type": "float",
"nullable": false
}
],
"type": "struct"
}
Hope it helps!
CodePudding user response:
Here is a solution without the 3rd party dependency pandas
. And I also would say the code looks easier to read and less complex.
The ordering is done with a step in between. Each CSV line/row is stored as dict
element which itself is in a parent dict
and indexed by the (cast to int
) value of Order
. Then this dict
is sorted by it's keys (which are Order
) and then transformed to a list
.
#!/usr/bin/env python3
import io
import csv
import json
# simulate a csv-file
csv_string = '''Order|AA|BB|CC|DD
2|status1|Cdd.int|true|false
12|status2|Cdd.String|true|false
1|status3|Cdd.Float|false|true
'''
csv_string = csv_string.split('\n')
result = {
'fields': [],
'type': 'struct'
}
# store the result temporary
temp_result = {}
reader = csv.reader(csv_string, delimiter='|')
next(reader) # ignore the header
for row in reader:
# ignore empty rows
if not row:
continue
# use 'Order' as key
temp_result[int(row[0])] = {
'metadata': {},
'name': row[1],
'type': row[2][4:],
'nullable': row[3] == 'true'
}
# sort the temporary result by their keys
temp_result = {key: temp_result[key] for key in sorted(temp_result)}
# transform to a list
temp_result = list(temp_result.values())
# final result
result = {
'fields': temp_result,
'type': 'struct'
}
# show result
print(json.dumps(result, indent=4))
In your real or productive code you have to replace csv_string
with a file-like object.
csv_file_path = pathlib.Path('data.csv')
with csv_file_path.open('r') as file_handle:
reader = csv.reader(file_handle, delimiter='|')
Advice about result
Your resulting data structure looks a bit wired and unpythonic. I would suggest to remove the name
element from the dicts
but use it as a dict key.
{
"fields": {
"status1": {
"metadata": {},
"type": "int",
"nullable": true
},
"status2": {
"metadata": {},
"type": "String",
"nullable": true
},
"status3": {
"metadata": {},
"type": "Float",
"nullable": false
}
},
"type": "struct"
}