Home > OS >  Split a big text file into multiple smaller one on set parameter of regex
Split a big text file into multiple smaller one on set parameter of regex

Time:08-18

I have a large text file looking like:

....
sdsdsd
..........

asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.

......
ddss
................

asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.

.....
xxxx
.......
asdfghjkl

I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like

group1_sdsdsd.txt

....
sdsdsd
..........

asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.

group1_ddss.txt

ddss
................

asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.

and

group1_xxxx.txt

.....
xxxx
.......

asdfghjkl

I have figured that by usinf regex of sort of following can be done

txt =re.sub(r'(([^\w\s])\2 )', r' ', txt).strip() #for letters more than 2 times

but not able to figure out completely.

The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.

CodePudding user response:

If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like:

^\.{3,}\n(\S )\n\.{3,}(?:\n(?!\.{3,}\n\S \n\.{3,}).*)*

Explanation

  • ^ Start of string
  • \.{3,}\n Match 3 or more dots and a newline
  • (\S )\n Capture 1 non whitespace chars in group 1 for the filename and match a newline
  • \.{3,} Match 3 or more dots
  • (?: Non capture group to repeat as a whole part
    • \n Match a newline
    • (?!\.{3,}\n\S \n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between
    • .* Match the whole line
  • )* Close the non capture group and optionally repeat it

Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename.

See a regex demo and a Python demo with the separate parts.

Example code

import re

pattern = r"^\.{3,}\n(\S )\n\.{3,}(?:\n(?!\.{3,}\n\S \n\.{3,}).*)*"

s = ("....your data here")

matches = re.finditer(pattern, s, re.MULTILINE)
your_path = "/your/path/"

for matchNum, match in enumerate(matches, start=1):
    f = open(your_path   "group1_{}".format(match.group(1)), 'w')
    f.write(match.group())
    f.close()
  • Related