I am desperately trying to convert a nested JSON feature within a CSV into data frame rows. Could you help?
Sample CSV row
2021-09-26T08:25:43.021051958Z,"{""level"":""info"",""message"":""Success (Cached)"",""request"":""GET /api/v1/settingsid=3"",""httpCode"":200,""service"":""stats-vis-backend"",""timestamp"":""2021-09-26 08:25:43""}",ip-10-xxx-xxx-18.eu-central-1.compute.internal,podname-75ffdf6b-gns8v
Desired output (using JSON part only):
id | message | request | httpCode | service | timestamp |
---|---|---|---|---|---|
0 | Success (Cached) | GET /api/v1/settings?id=3 | 200 | stats-vis-backend | 2021-09-26 08:25:43 |
If this would be the data frame output structure, I would be more than happy. I tried JSON normalize etc., but I am far away from a solution.
Thanks so much!!
Best David
Full Code trial (based on SeaBean):
import csv
import ast
import pandas as pd
# read CSV
df = pd.read_csv('/Users/David/xaa.csv',sep=',', header=None)
print(df.head(1))
# convert string of JSON/dict to real JSON/dict
# the JSON/dict is at column `1` (second column from left)
df[1] = df[1].apply(ast.literal_eval)
# Create dataframe from the JSON part
df_json = pd.DataFrame(df[1].tolist())
print(df_json.head(1))
Full output dump
File "/Users/David/opt/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-7-86e494aa8f0c>", line 12, in <module>
df[1] = df[1].apply(ast.literal_eval)
File "/Users/David/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py", line 4138, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2467, in pandas._libs.lib.map_infer
File "/Users/David/opt/anaconda3/lib/python3.8/ast.py", line 59, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/Users/David/opt/anaconda3/lib/python3.8/ast.py", line 47, in parse
return compile(source, filename, mode, flags,
File "<unknown>", line 1
> next start
^
SyntaxError: invalid syntax
Sample Output df1
0 {"level":"info","message":"Success (Cached)","...
1 {"level":"info","message":"Success (Cached)","...
2 {"level":"info","message":"Success (Cached)","...
3 {"level":"info","message":"Success","request":...
4 {"level":"info","message":"Success (Cached)","...
...
249995 {"level":"info","message":"Success (Cached)","...
249996 {"level":"info","message":"Success (Cached)","...
249997 {"level":"info","message":"Success (Cached)","...
249998 {"level":"info","message":"Success","request":...
249999 {"level":"info","message":"Success (Cached)","...
Name: 1, Length: 250000, dtype: object
Sample toDict() Output of df1
{0: '{"level":"info","message":"Success (Cached)","request":"GET /api/v1/settings?id=3","httpCode":200,"service":"stats-vis-backend","timestamp":"2021-09-26 08:25:43"}',
1: '{"level":"info","message":"Success (Cached)","request":"GET /api/v1/settings?id=3","httpCode":200,"service":"stats-vis-backend","timestamp":"2021-09-26 08:26:17"}',
Output print(df.iloc[[4480]])
0 1 \
4480 2021-09-26T12:00:58.983344643Z > next start
2 \
4480 ip-10-xxx-xxxx-30.eu-central-1.compute.internal
3
4480 xxxx-converter-75ffxf6b-jq2w7
CodePudding user response:
You can use pd.DataFrame
on the list of column values of the second column (with JSON) after converting the string of JSON to real JSON (not in string), as follows:
# read CSV
df = pd.read_csv(r'mycsv.csv', sep=',', header=None)
# convert string of JSON/dict to real JSON/dict
import ast
# the JSON/dict is at column `1` (second column from left)
df[1] = df[1].apply(ast.literal_eval)
# Create dataframe from the JSON part
df_json = pd.DataFrame(df[1].tolist())
If you have already read the CSV into dataframe with column header, you can also use the column label of the second column instead of 1
for the column label for second column in the codes above.
Result:
print(df_json)
level message request httpCode service timestamp
0 info Success (Cached) GET /api/v1/settingsid=3 200 stats-vis-backend 2021-09-26 08:25:43