I have this input txt file that contains (image_name, other meta data not needed, and last column tokens separated by | character ) example of input: input.txt
a01-000u-00 ok 154 19 408 746 1661 89 A|MOVE|to|stop|Mr.|Gaitskell|from
a01-000u-01 ok 156 19 395 932 1850 105 nominating|any|more|Labour|life|Peers
a01-000u-02 ok 157 16 408 1106 1986 105 is|to|be|made|at|a|meeting|of|Labour
The expected output I want : is expected_out.txt or data frame that has only image_name and text
a01-000u-00.png A MOVE to stop Mr. Gaitskell from
a01-000u-01.png nominating any more Labour life Peers
a01-000u-02.png is to be made at a meeting of Labour
the script to process the file is as below :
train_text = 'input.txt'
def load_data() -> pd.DataFrame:
data = []
with open(train_text) as infile:
for line in infile:
file_name, _, _, _, _, _, _, _, text = line.strip().split(' ')
data.append((file_name, process_last_column(text)))
df = pd.DataFrame(data, columns=['file_name', 'text'])
df.rename(columns={0: 'file_name', 8: 'text'}, inplace=True)
df['file_name'] = df['file_name'].apply(lambda x: x '.png')
df = df[['file_name', 'text']]
return df
def process_last_column(input_text: str) -> str:
return input_text.replace('|', ' ')
The error I got is :
Traceback (most recent call last):
File "train.py", line 205, in <module>
main()
File "train.py", line 146, in main
df = load_data()
File "train.py", line 108, in load_iam
file_name, _, _, _, _, _, _, _, text = line.strip().split(" ")
ValueError: not enough values to unpack (expected 9, got 1)
CodePudding user response:
You might combine unpacking with * like so, let file.txt
content be
a01-000u-00 ok 154 19 408 746 1661 89 A|MOVE|to|stop|Mr.|Gaitskell|from
a01-000u-01 ok 156 19 395 932 1850 105 nominating|any|more|Labour|life|Peers
a01-000u-02 ok 157 16 408 1106 1986 105 is|to|be|made|at|a|meeting|of|Labour
then
with open("file.txt","r") as f:
for line in f:
x, *_, y = line.strip().split()
print(x,y.replace("|"," "))
gives output
a01-000u-00 A MOVE to stop Mr. Gaitskell from
a01-000u-01 nominating any more Labour life Peers
a01-000u-02 is to be made at a meeting of Labour
Note: for simplicity sake I output to standard output.
CodePudding user response:
...
split = line.strip().split(' ')
file_name = split[0]
text = split[-1]
data.append((file_name, process_last_column(text)))
...
Will get you past the ValueError