Home > Back-end >  How to split a file by using string as identifier with python?
How to split a file by using string as identifier with python?

Time:11-22

I have a huge text file and need to split it to some file. In the text file there is an identifier to split the file. Here is some part of the text file looks like:

Comp MOFVersion 10.1
Copyright 1997-2006. All rights reserved.
-------------------------------------------------- 
Mon 11/19/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
...


-------------------------------------------------- 
Tue 11/20/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
....


-------------------------------------------------- 
Wed 11/21/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here

Done

My expectation is split the file by mapping the string "Starting The Process". So if I have a text file like above example, then the file will split to 3 files and each file has differen content. For example:

file1
-------------------------------------------------- 
Mon 11/19/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
...


file2
-------------------------------------------------- 
Tue 11/20/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
....

file 3
-------------------------------------------------- 
Wed 11/21/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here

Done

Is it possible to do it in Python? Thank you for any advice.

CodePudding user response:

Well if the file is small enough to comfortably fit into memory (say 1GB or less), you could read the entire file into a string and then use re.findall:

with open('data.txt', 'r') as file:
    data = file.read()
    parts = re.findall(r'-{10,}[^-]*\n\w{3} \d{2}\/\d{2}\/\d{4}.*?-{10,}.*?(?=-{10,}|$)', data, flags=re.S)

cnt = 1
for part in parts:
    output = open('file '   str(cnt), 'w')
    output.write(part)
    output.close()
    cnt = cnt   1

CodePudding user response:

An alternative solution if the dashes in the file are of fixed length could be:

with open('file.txt', 'r') as f: 
split_text = f.read().split('--------------------------------------------------')
split_text.remove('') 

for i in range(0, len(split_text) - 1, 2): 
    with open(f'file{i}.txt', 'w') as temp: 
        temp_txt = ''.join(split_text[i:i 2])
        temp.write(temp_txt)  

Essentially, I am just splitting on the basis of those dashes and joining every consecutive element. This way you keep the info about the timestamp with the content in each file.

  • Related