Home > Back-end >  Pandas read_json : skip first lines of the file
Pandas read_json : skip first lines of the file

Time:10-15

Say I have a json file with lines of data like this :

file.json :

{'ID':'098656', 'query':'query_file.txt'}

{'A':1, 'B':2}
{'A':3, 'B':6}
{'A':0, 'B':4}
...

where the first line is just explanations about the given file and how it was created. I would like to open it with something like :

import pandas as pd
df = pd.read_json('file.json', lines=True)

However, how do I read the data starting on line 3 ? I know that pd.read_csv has a skiprows argument, but it does not look like pd.read_json has one.

I would like something returning a DataFrame with the columns A and B only, and possibly better than dropping the first line and ID and query columns after loading the whole file.

CodePudding user response:

You can read the lines in the file and skip the first n ones, then pass it to pandas:

import pandas as pd
import json


with open('filename.json') as f:
    lines = f.read().splitlines()[2:]

df_tmp = pd.DataFrame(lines)
df_tmp.columns = ['json_data']

df_tmp['json_data'].apply(json.loads)

df = pd.json_normalize(df_tmp['json_data'].apply(json.loads))

CodePudding user response:

We can pass into pandas.read_json a file handler as well. If before that we read part of the data, then only the rest will be converted to DataFrame.

def read_json(file, skiprows):
    with open(file) as f:
        f.readlines(skiprows)
        df = pd.read_json(f, lines=True) 
    return df
  • Related