I have a json dataset that I'd like to use for an ml project, after each record its missing the commma(,) so i'm not able to process it using pandas. What could I do to correct the format of the file? The link to the dataset is [https://www.kaggle.com/datasets/rmisra/news-category-dataset][1]
CodePudding user response:
Each line of the file is its own, json. You can put them in a list to form a df:
import json
import pandas as pd
with open('News_Category_Dataset_v2.json', 'r') as f:
df = pd.DataFrame([json.loads(l) for l in f.readlines()])
print(df)
Output:
category ... date
0 CRIME ... 2018-05-26
1 ENTERTAINMENT ... 2018-05-26
2 ENTERTAINMENT ... 2018-05-26
3 ENTERTAINMENT ... 2018-05-26
4 ENTERTAINMENT ... 2018-05-26
... ... ... ...
200848 TECH ... 2012-01-28
200849 SPORTS ... 2012-01-28
200850 SPORTS ... 2012-01-28
200851 SPORTS ... 2012-01-28
200852 SPORTS ... 2012-01-28
[200853 rows x 6 columns]