Pandas read_json : skip first lines of the file-CodePudding

Say I have a json file with lines of data like this :

file.json :

{'ID':'098656', 'query':'query_file.txt'}

{'A':1, 'B':2}
{'A':3, 'B':6}
{'A':0, 'B':4}
...

where the first line is just explanations about the given file and how it was created. I would like to open it with something like :

import pandas as pd
df = pd.read_json('file.json', lines=True)

However, how do I read the data starting on line 3 ? I know that pd.read_csv has a skiprows argument, but it does not look like pd.read_json has one.

I would like something returning a DataFrame with the columns A and B only, and possibly better than dropping the first line and ID and query columns after loading the whole file.

CodePudding user response：

You can read the lines in the file and skip the first n ones, then pass it to pandas:

import pandas as pd
import json


with open('filename.json') as f:
    lines = f.read().splitlines()[2:]

df_tmp = pd.DataFrame(lines)
df_tmp.columns = ['json_data']

df_tmp['json_data'].apply(json.loads)

df = pd.json_normalize(df_tmp['json_data'].apply(json.loads))

CodePudding user response：

We can pass into pandas.read_json a file handler as well. If before that we read part of the data, then only the rest will be converted to DataFrame.

def read_json(file, skiprows):
    with open(file) as f:
        f.readlines(skiprows)
        df = pd.read_json(f, lines=True) 
    return df