Converting a text document into a jsonl (json lines) format-CodePudding

I want to convert a text file into a json lines format using Python. I need this to be applicable to a text file of any length (in characters or words).

As an example, I want to convert the following text;

A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so. 

These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics.

To this:

{"text": "A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so."}
{"text": "These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics."}

I tried this:

text = ""
with open(text.txt", encoding="utf8") as f:
    for line in f:
        text = {"text": line}

But not luck.

CodePudding user response：

A hacky way of doing this is to paste the text file into a csv. Make sure to write text in the first cell of the csv then use this code:

import pandas as pd 

df = pd.read_csv(knowledge)
    df.to_json(knowledge_jsonl,
               orient="records",
               lines=True)

Not ideal but it works.

CodePudding user response：

The basic idea of your for loop was correct but the line text = {"text": line} is just overwriting the previous line every time, whereas what you want is to generate a list of lines.

Try the following:

import json

# Generate a list of dictionaries
lines = []
with open("text.txt", encoding="utf8") as f:
    for line in f.read().splitlines():
        if line:
            lines.append({"text": line})

# Convert to a list of JSON strings
json_lines = [json.dumps(l) for l in lines]

# Join lines and save to .jsonl file
json_data = '\n'.join(json_lines)
with open('my_file.jsonl', 'w') as f:
    f.write(json_data)

splitlines removes the \n characters and if line: ignores blank lines.