Home > Back-end >  Merge two texts without duplicate text blocks mantaing the internal order
Merge two texts without duplicate text blocks mantaing the internal order

Time:09-27

How can I create a text c that is composed by the union of the different descriptions of geopolitical places chosen by the first element of the first line or in other ways by textual/numerical blocks characterized from the word of the first line (name of the place)

text a

   **USA**                                                    1776
   descript 1 numeric ref.                 cod.0211-0154858-5787-1
   descript 2 numeric ref.                 cod.0211-0154858-5787-1
   descript 3 numeric ref.                 cod.0211-0154858-5787-1
   **Europe**                                                 1948
   descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
   descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
   descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
   **Japan**                                                  1947                                            
   descript 1 numeric ref.               cod.0154858-8788-6885-sd1
   descript 2 numeric ref.               cod.0154858-8788-6885-sd1
   descript 3 numeric ref.               cod.0154858-8788-6885-sd1

text b

 **UK**                                                        1776
   descript 1 numeric ref.                  cod.0211-0154858-5787-1
   descript 2 numeric ref.                  cod.0211-0154858-5787-1
   descript 3 numeric ref.                  cod.0211-0154858-5787-1
   **NETHERLANDS**                                             1948
   descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
   descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
   descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
   **MEXICO**                                                 1947                                            
   descript 1 numeric ref.               cod.0154858-8788-6885-sd18
   descript 2 numeric ref.               cod.0154858-8788-6885-sd18
   descript 3 numeric ref.               cod.0154858-8788-6885-sd18
   **Europe**                                                 1948
   descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
   descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
   descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
   **Japan**                                                  1947                                            
   descript 1 numeric ref.               cod.0154858-8788-6885-sd1
   descript 2 numeric ref.               cod.0154858-8788-6885-sd1
   descript 3 numeric ref.               cod.0154858-8788-6885-sd1

hypothetical text c as result

   **USA**                                                    1776
   descript 1 numeric ref.                 cod.0211-0154858-5787-1
   descript 2 numeric ref.                 cod.0211-0154858-5787-1
   descript 3 numeric ref.                 cod.0211-0154858-5787-1
   **Europe**                                                 1948
   descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
   descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
   descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
   **Japan**                                                  1947                                            
   descript 1 numeric ref.               cod.0154858-8788-6885-sd1
   descript 2 numeric ref.               cod.0154858-8788-6885-sd1
   descript 3 numeric ref.               cod.0154858-8788-6885-sd1

   **UK**                                                      1776
   descript 1 numeric ref.                  cod.0211-0154858-5787-1
   descript 2 numeric ref.                  cod.0211-0154858-5787-1
   descript 3 numeric ref.                  cod.0211-0154858-5787-1
   **NETHERLANDS**                                             1948
   descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
   descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
   descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
   **MEXICO**                                                 1947                                            
   descript 1 numeric ref.               cod.0154858-8788-6885-sd18
   descript 2 numeric ref.               cod.0154858-8788-6885-sd18
   descript 3 numeric ref.               cod.0154858-8788-6885-sd18

I run this:

import os, glob
files = glob.glob('*.txt')

all_lines = []
for f in files:
    with open(f,'r') as fi:
        all_lines  = fi.readlines()
all_lines = set(all_lines)

with open('combinedfile.txt','w') as fo:
    fo.write("\n".join(all_lines))
        

It seems to merge in the way I would like but doesn't respect any order and gives back spaces. Some suggestion?

I attach the result here:

   descript 2 numeric ref.                  cod.0211-0154858-5787-1

   **NETHERLANDS**                                             1948

   descript 1 numeric ref.                  cod.0211-0154858-5787-1

   descript 3 numeric ref.                 cod.0211-0154858-5787-1

   descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12

   descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1

   descript 1 numeric ref.               cod.0154858-8788-6885-sd1

   descript 3 numeric ref.               cod.0154858-8788-6885-sd1

   descript 1 numeric ref.                 cod.0211-0154858-5787-1

   descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1

   descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1

   **Japan**                                                  1947                                            

**USA**                                                    1776

   descript 3 numeric ref.               cod.0154858-8788-6885-sd18



   descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12

   descript 2 numeric ref.               cod.0154858-8788-6885-sd1

   **MEXICO**                                                 1947                                            

   descript 2 numeric ref.               cod.0154858-8788-6885-sd18

   descript 1 numeric ref.               cod.0154858-8788-6885-sd18

   descript 2 numeric ref.                 cod.0211-0154858-5787-1

   **Europe**                                                 1948

   descript 3 numeric ref.                  cod.0211-0154858-5787-1

   descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12

 **UK**                                                        1776

Thank you

CodePudding user response:

Does this answer you question ?

The '\n' you added to a join added a line break to each element.

The creation of a set does not keep the order, that's why the lines were disordered.

import glob

files = glob.glob('*.txt')

all_lines = []
for f in files:
    if f == 'combinedfile.txt':  #ignore destination file if exists
        continue
    with open(f, 'r') as fi:
        all_lines  = fi.readlines()

all_lines = list(dict.fromkeys(all_lines))  # ignore duplicates --> set() isn't ordered

with open('combinedfile.txt', 'w') as fo:
    fo.writelines(all_lines)  # write all lines in one

CodePudding user response:

Try this:

def changing_new_file(text_block, new_file):
    if text_block and text_block not in new_file:
        new_file  = [text_block]
    return new_file


def reading_files(files):
    new_file = []
    for file in files:
        with open(file, 'r') as target_file:
            text_block = []
            for line in target_file:
                if '**' in line:
                    new_file = changing_new_file(text_block, new_file)
                    text_block = [line]
                else:
                    text_block  = [line]
            new_file = changing_new_file(text_block, new_file)
            new_file  = [['\n']]
    return new_file


def creating_new_file(new_file):
    with open('combinedfile.txt', 'w') as fo:
        for text_block in new_file:
            fo.write("\n".join(text_block))


if __name__ == '__main__':
    files = ['text_a.txt', 'text_b.txt']
    new_file = reading_files(files)
    creating_new_file(new_file)

The result is:

   **USA**                                                    1776

   descript 1 numeric ref.                 cod.0211-0154858-5787-1

   descript 2 numeric ref.                 cod.0211-0154858-5787-1

   descript 3 numeric ref.                 cod.0211-0154858-5787-1
   **Europe**                                                 1948

   descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1

   descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1

   descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-1
   **Japan**                                                  1947                                            

   descript 1 numeric ref.               cod.0154858-8788-6885-sd1

   descript 2 numeric ref.               cod.0154858-8788-6885-sd1

   descript 3 numeric ref.               cod.0154858-8788-6885-sd1
   **UK**                                                        1776

   descript 1 numeric ref.                  cod.0211-0154858-5787-1

   descript 2 numeric ref.                  cod.0211-0154858-5787-1

   descript 3 numeric ref.                  cod.0211-0154858-5787-1
   **NETHERLANDS**                                             1948

   descript 1 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12

   descript 2 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12

   descript 3 numeric ref. cod.5485-22218872e-1-2889978dd-sd11s5-12
   **MEXICO**                                                 1947                                            

   descript 1 numeric ref.               cod.0154858-8788-6885-sd18

   descript 2 numeric ref.               cod.0154858-8788-6885-sd18

   descript 3 numeric ref.               cod.0154858-8788-6885-sd18

This algorithm divides text into blocks and compare each block. Do not add the same blocks.

  • Related