python regex Re.sub Empty Lines Removing-CodePudding

hi I'm working on a code to speed up my work, I have a little problem in the format has blank line

.Devi  
       
.Dave  
       
.Liana 
.Ricky 
.Oswyne

.Devi
.Putra
.Kelvin

.Gilang

.Delvin

this is my source

import re 
filex = input('Input file name : ') 
file = open(filex, "r")
st3 = file.read() 
pattern = r"\bOperational\.[0-9]{1,4}|\bManagement\.[0-9]{1,4}|\bAdmin\.[0-9]{1,4}|\bStaff.{1,25}"
mod_string = re.sub(pattern, '', st3 ) 
print(mod_string)
open("clean.txt" ,"w").write(mod_string)

this is the list i want to filter

position | Staff Number | Name
Operational.1252.Devi
Staff.1875.Erin
Operational.1552.Dave
Staff.1875.Hutri
Operational.1952.Liana
Management.1292.Ricky
Staff.1875.Udin
Management.1852.Oswyne
Staff.1875.Udin
Operational.1052.Devi
Management.1282.Putra
Operational.1262.Kelvin
Admin.9823.Gilang
Staff.1275.Siska
Staff.1835.Udin
Admin.9823.Gilang
Staff.1875.Silalahi
Management.1282.Delvin
and more List....

and I want to make my format to be like with out blank line & Without duplicate line

.Devi
.Dave
.Liana
.Ricky
.Oswyne
.Devi
.Putra
.Kelvin
.Gilang
.Delvin

CodePudding user response：

By using your data I make another script without using regex but using readlines and split

the idea is to read the file line by line using readlines, after that you can split the string into 3 part using . as separator, and get the last string

if you need to use regex you can ignore this answer

file1 = open('test.txt', 'r')
lines = file1.readlines()
# ignore first line
lines = lines[1:]

output_file = open('output.txt','w')

for line in lines:
    # split the line using . as separator and get last string
    output_file.write(line.split('.')[2])
    
output_file.close()

the output will be :

Devi
Erin
Dave
Hutri
Liana
Ricky
Udin
Oswyne
Udin
Devi
Putra
Kelvin
Gilang
Siska
Udin
Gilang
Silalahi
Delvin

CodePudding user response：

thank you all, I've got a little idea from uncle google

filex = input('Input file name : ')
file = open(filex, "r")
st3 = file.read()
pattern = r"\bOperational\.[0-9]{1,4}|\bManagement\.[0-9]{1,4}|\bAdmin\.[0-9]{1,4}|\bStaff\.[0-9]{1,4}"
mod_string = re.sub(pattern, '', st3)
lines = mod_string.split("\n")
non_empty_lines = [line for line in lines if line.strip() != ""]
string_without_empty_lines = ""
for line in non_empty_lines:
        string_without_empty_lines  = line   "\n"
words = string_without_empty_lines.split()
print ("" '\n'.join(sorted(set(words), key=words.index )))

and the result comes out like this without Duplicate :D

.Devi  
.Erin  
.Dave  
.Hutri 
.Liana 
.Ricky 
.Oswyne
.Udin
.Putra
.Kelvin
.Siska
.Gilang
.Silalahi
.Delvin

result proof : https://prnt.sc/AcCX7JmlvvyN

CodePudding user response：

Instead of using re.sub and splitting the lines, you could also use a specific match with a capture group.

Looking at your provided answer, you can shorted the pattern to:

^(?:Operational|Management|Admin|Staff)\.[0-9]{1,4}(\.. )

In parts, the pattern matches:

^ Start of string
(?: Non capture group
- Operational|Management|Admin|Staff Match one of the alternatives
) Close non capture group
\.[0-9]{1,4} Match . and 1-4 digits 0-9
(\.. ) Capture group 1, match a . and 1 or more times any character

See a regex demo and a Python demo.

For example:

import re

filex = input('Input file name : ')
file = open(filex, "r")
st3 = file.read()
pattern = r"^(?:Operational|Management|Admin|Staff)\.[0-9]{1,4}\.(. )"

result = sorted(set(re.findall(pattern, st3, re.M)))
print(result)

Output of the sorted set, where re.findall returns the value of the capture group 1 values:

[
 '.Dave',
 '.Delvin',
 '.Devi',
 '.Erin',
 '.Gilang',
 '.Hutri',
 '.Kelvin',
 '.Liana',
 '.Oswyne',
 '.Putra',
 '.Ricky',
 '.Silalahi',
 '.Siska',
 '.Udin'
]