How can I create a text c that is composed by the union of the different descriptions of geopolitical places chosen by the first element of the first line or in other ways by textual/numerical blocks characterized from the word of the first line (name of the place)
text a
**USA** 1776
descript 1 numeric ref. cod.0211-0154858-5787-1
descript 2 numeric ref. cod.0211-0154858-5787-1
descript 3 numeric ref. cod.0211-0154858-5787-1
**Europe** 1948
descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
**Japan** 1947
descript 1 numeric ref. cod.0154858-8788-6885-sd1
descript 2 numeric ref. cod.0154858-8788-6885-sd1
descript 3 numeric ref. cod.0154858-8788-6885-sd1
text b
**UK** 1776
descript 1 numeric ref. cod.0211-0154858-5787-1
descript 2 numeric ref. cod.0211-0154858-5787-1
descript 3 numeric ref. cod.0211-0154858-5787-1
**NETHERLANDS** 1948
descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
**MEXICO** 1947
descript 1 numeric ref. cod.0154858-8788-6885-sd18
descript 2 numeric ref. cod.0154858-8788-6885-sd18
descript 3 numeric ref. cod.0154858-8788-6885-sd18
**Europe** 1948
descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
**Japan** 1947
descript 1 numeric ref. cod.0154858-8788-6885-sd1
descript 2 numeric ref. cod.0154858-8788-6885-sd1
descript 3 numeric ref. cod.0154858-8788-6885-sd1
hypothetical text c as result
**USA** 1776
descript 1 numeric ref. cod.0211-0154858-5787-1
descript 2 numeric ref. cod.0211-0154858-5787-1
descript 3 numeric ref. cod.0211-0154858-5787-1
**Europe** 1948
descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
**Japan** 1947
descript 1 numeric ref. cod.0154858-8788-6885-sd1
descript 2 numeric ref. cod.0154858-8788-6885-sd1
descript 3 numeric ref. cod.0154858-8788-6885-sd1
**UK** 1776
descript 1 numeric ref. cod.0211-0154858-5787-1
descript 2 numeric ref. cod.0211-0154858-5787-1
descript 3 numeric ref. cod.0211-0154858-5787-1
**NETHERLANDS** 1948
descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
**MEXICO** 1947
descript 1 numeric ref. cod.0154858-8788-6885-sd18
descript 2 numeric ref. cod.0154858-8788-6885-sd18
descript 3 numeric ref. cod.0154858-8788-6885-sd18
I run this:
import os, glob
files = glob.glob('*.txt')
all_lines = []
for f in files:
with open(f,'r') as fi:
all_lines = fi.readlines()
all_lines = set(all_lines)
with open('combinedfile.txt','w') as fo:
fo.write("\n".join(all_lines))
It seems to merge in the way I would like but doesn't respect any order and gives back spaces. Some suggestion?
I attach the result here:
descript 2 numeric ref. cod.0211-0154858-5787-1
**NETHERLANDS** 1948
descript 1 numeric ref. cod.0211-0154858-5787-1
descript 3 numeric ref. cod.0211-0154858-5787-1
descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
descript 1 numeric ref. cod.0154858-8788-6885-sd1
descript 3 numeric ref. cod.0154858-8788-6885-sd1
descript 1 numeric ref. cod.0211-0154858-5787-1
descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
**Japan** 1947
**USA** 1776
descript 3 numeric ref. cod.0154858-8788-6885-sd18
descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
descript 2 numeric ref. cod.0154858-8788-6885-sd1
**MEXICO** 1947
descript 2 numeric ref. cod.0154858-8788-6885-sd18
descript 1 numeric ref. cod.0154858-8788-6885-sd18
descript 2 numeric ref. cod.0211-0154858-5787-1
**Europe** 1948
descript 3 numeric ref. cod.0211-0154858-5787-1
descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
**UK** 1776
Thank you
CodePudding user response:
Does this answer you question ?
The '\n'
you added to a join added a line break to each element.
The creation of a set does not keep the order, that's why the lines were disordered.
import glob
files = glob.glob('*.txt')
all_lines = []
for f in files:
if f == 'combinedfile.txt': #ignore destination file if exists
continue
with open(f, 'r') as fi:
all_lines = fi.readlines()
all_lines = list(dict.fromkeys(all_lines)) # ignore duplicates --> set() isn't ordered
with open('combinedfile.txt', 'w') as fo:
fo.writelines(all_lines) # write all lines in one
CodePudding user response:
Try this:
def changing_new_file(text_block, new_file):
if text_block and text_block not in new_file:
new_file = [text_block]
return new_file
def reading_files(files):
new_file = []
for file in files:
with open(file, 'r') as target_file:
text_block = []
for line in target_file:
if '**' in line:
new_file = changing_new_file(text_block, new_file)
text_block = [line]
else:
text_block = [line]
new_file = changing_new_file(text_block, new_file)
new_file = [['\n']]
return new_file
def creating_new_file(new_file):
with open('combinedfile.txt', 'w') as fo:
for text_block in new_file:
fo.write("\n".join(text_block))
if __name__ == '__main__':
files = ['text_a.txt', 'text_b.txt']
new_file = reading_files(files)
creating_new_file(new_file)
The result is:
**USA** 1776
descript 1 numeric ref. cod.0211-0154858-5787-1
descript 2 numeric ref. cod.0211-0154858-5787-1
descript 3 numeric ref. cod.0211-0154858-5787-1
**Europe** 1948
descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
**Japan** 1947
descript 1 numeric ref. cod.0154858-8788-6885-sd1
descript 2 numeric ref. cod.0154858-8788-6885-sd1
descript 3 numeric ref. cod.0154858-8788-6885-sd1
**UK** 1776
descript 1 numeric ref. cod.0211-0154858-5787-1
descript 2 numeric ref. cod.0211-0154858-5787-1
descript 3 numeric ref. cod.0211-0154858-5787-1
**NETHERLANDS** 1948
descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
**MEXICO** 1947
descript 1 numeric ref. cod.0154858-8788-6885-sd18
descript 2 numeric ref. cod.0154858-8788-6885-sd18
descript 3 numeric ref. cod.0154858-8788-6885-sd18
This algorithm divides text into blocks and compare each block. Do not add the same blocks.