How to handle txt file using pandas and save it is results-CodePudding

I have this input txt file that contains (image_name, other meta data not needed, and last column tokens separated by | character ) example of input: input.txt

a01-000u-00 ok 154 19 408 746 1661 89 A|MOVE|to|stop|Mr.|Gaitskell|from
a01-000u-01 ok 156 19 395 932 1850 105 nominating|any|more|Labour|life|Peers
a01-000u-02 ok 157 16 408 1106 1986 105 is|to|be|made|at|a|meeting|of|Labour

The expected output I want : is expected_out.txt or data frame that has only image_name and text

a01-000u-00.png A MOVE to stop Mr. Gaitskell from
a01-000u-01.png nominating any more Labour life Peers
a01-000u-02.png is to be made at a meeting of Labour

the script to process the file is as below :

train_text = 'input.txt'

def load_data() -> pd.DataFrame:
    data = []
    with open(train_text) as infile:
        for line in infile:
            file_name, _, _, _, _, _, _, _, text = line.strip().split(' ')
            data.append((file_name, process_last_column(text)))

    df = pd.DataFrame(data, columns=['file_name', 'text'])
    df.rename(columns={0: 'file_name', 8: 'text'}, inplace=True)
    df['file_name'] = df['file_name'].apply(lambda x: x   '.png')

    df = df[['file_name', 'text']]
    return df

def process_last_column(input_text: str) -> str:
    return input_text.replace('|', ' ')

The error I got is :

Traceback (most recent call last):
  File "train.py", line 205, in <module>
    main()
  File "train.py", line 146, in main
    df = load_data()
  File "train.py", line 108, in load_iam
    file_name, _, _, _, _, _, _, _, text = line.strip().split(" ")
ValueError: not enough values to unpack (expected 9, got 1)

CodePudding user response：

You might combine unpacking with * like so, let file.txt content be

a01-000u-00 ok 154 19 408 746 1661 89 A|MOVE|to|stop|Mr.|Gaitskell|from
a01-000u-01 ok 156 19 395 932 1850 105 nominating|any|more|Labour|life|Peers
a01-000u-02 ok 157 16 408 1106 1986 105 is|to|be|made|at|a|meeting|of|Labour

then

with open("file.txt","r") as f:
    for line in f:
        x, *_, y = line.strip().split()
        print(x,y.replace("|"," "))

gives output

a01-000u-00 A MOVE to stop Mr. Gaitskell from
a01-000u-01 nominating any more Labour life Peers
a01-000u-02 is to be made at a meeting of Labour

Note: for simplicity sake I output to standard output.

CodePudding user response：

...
            split = line.strip().split(' ')
            file_name = split[0]
            text = split[-1]
            data.append((file_name, process_last_column(text)))
...

Will get you past the ValueError