Split a big text file into multiple smaller one on set parameter of regex-CodePudding

I have a large text file looking like:

....
sdsdsd
..........

asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.

......
ddss
................

asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.

.....
xxxx
.......
asdfghjkl

I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like

group1_sdsdsd.txt

....
sdsdsd
..........

asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.

group1_ddss.txt

ddss
................

asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.

and

group1_xxxx.txt

.....
xxxx
.......

asdfghjkl

I have figured that by usinf regex of sort of following can be done

txt =re.sub(r'(([^\w\s])\2 )', r' ', txt).strip() #for letters more than 2 times

but not able to figure out completely.

The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.

CodePudding user response：

If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like:

^\.{3,}\n(\S )\n\.{3,}(?:\n(?!\.{3,}\n\S \n\.{3,}).*)*

Explanation

^ Start of string
\.{3,}\n Match 3 or more dots and a newline
(\S )\n Capture 1 non whitespace chars in group 1 for the filename and match a newline
\.{3,} Match 3 or more dots
(?: Non capture group to repeat as a whole part
- \n Match a newline
- (?!\.{3,}\n\S \n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between
- .* Match the whole line
)* Close the non capture group and optionally repeat it

Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename.

See a regex demo and a Python demo with the separate parts.

Example code

import re

pattern = r"^\.{3,}\n(\S )\n\.{3,}(?:\n(?!\.{3,}\n\S \n\.{3,}).*)*"

s = ("....your data here")

matches = re.finditer(pattern, s, re.MULTILINE)
your_path = "/your/path/"

for matchNum, match in enumerate(matches, start=1):
    f = open(your_path   "group1_{}".format(match.group(1)), 'w')
    f.write(match.group())
    f.close()