Situation: I have a flat file of data with various elements in it and I need to extract specific portions. I am a beginner in Python and wrote it out using Regular Expressions and other functions. Here is a sample of the data from the txt file I receive:
**ACCESSORID = FS01234** TYPE = USER SIZE = 1024 BYTES
**NAME = JOHN SMITH** FACILITY = TSO
DEPT ACID = D12RGRD DEPARTMENT = TRAINING
DIV ACID = NR DIVISION = NRE
CREATED = 01/17/05 00:00 LAST MOD = 11/16/21 10:42
**PROFILES = VPSNRE P11NR00A**
LAST USED = 12/02/21 09:03 CPU(SYSB) FAC(SUPRSESS) COUNT(06051)
**XA SSN = 123456789** OWNER(JB112)
XA TSOACCT = 123456789 OWNER(JB112 )
XA TSOAUTH = JCL OWNER(JB112 )
XA TSOAUTH = RECOVER OWNER(JB112 )
XA TSOPROC = NR005PROC OWNER(JB112 )
----------- SEGMENT TSO
TRBA = NON-DISPLAY FIELD
TSOCOMMAND =
TSODEFPRFG =
TSOLACCT = 111111111
TSOLPROC = NR9923PROC
TSOLSIZE = 0004096
TSOOPT = MAIL,NONOTICES,NOOIDCARD
TSOUDATA = 0000
TSOUNIT = SYSDD
TUPT = NON-DISPLAY FIELD
----------- SEGMENT USER
**EMAIL ADDR = [email protected]**
The portions I need to extract are bolded. I know I need to provide what I have done so far and without posting my entire script, here is what I am doing to extract the ACCESSORID = FS01234 and NAME = JOHN SMITH portion.
def RemoveSpace():
f = open("PROJECTFILE.txt","r")
f1 = open("RemoveSpace.txt", "w")
data1 = f.read()
word = data1.split()
s = ' '.join(word)
f1.write(s)
print("Data Written Successfully")
RemoveSpace()
f = open(r"C:\Users\user\Desktop\HR\PROJECTFILE\RemoveSpace.txt".format(g), "r").read()
TSS = []
contents = re.split(r"ACCESSORID =",f)
contents.pop(0)
for item in contents:
TSS_DICT = {}
emplid = re.search(r"FS.*", item)
if emplid is not None:
s_emplid = re.search("FS\w*", emplid.group())
else:
s_emplid = None
if s_emplid is not None:
s_emplid = s_emplid.group()
else:
s_emplid = None
TSS_DICT["EMPLOYEE ID"] = s_emplid
name = re.search(r"NAME =.*", item)
if name is not None:
emp_name = re.search("[^NAME = ][^,]*", name.group())
else:
emp_name = None
if emp_name is not None:
emp_name = emp_name.group()
else:
emp_name = None
TSS_DICT["EMPLOYEE NAME"] = emp_name
Question: I am having some difficulty getting John Smith. It keeps bringing in everything after John Smith down to very last line of email address. My end goal is to get a CSV file with each bolded item as its own column. And more directly speaking, how would experts approach this data clean up approach to simplify the process? If needed I can post full code but didn't want to muddle this up anymore than needed.
CodePudding user response:
For practising your Regex, I recommend using a website like RegExr. Here, you can paste the text that you want to match and you can play around with different matching expressions to get the result that you intend.
Assuming that you want to use this code for multiple files of the same organisation and that the data is formatted the same way in each, you can simplify your code a lot.
Let's say we wanted to extract NAME = JOHN SMITH
from the text file. We could write the following Python code to do this:
import re
pattern = "NAME = \\w \\w "
name = re.findall(pattern, text_to_search)[0][7:]
print(name)
pattern
is our Regex search expression. text_to_search
is your text file that you have read into your Python script. re.findall()
returns a list of matched items that we then access the first index of with [0]
. We can then use string slicing ([7:]
) to remove the NAME =
bit.
The above code would output the following:
JOHN SMITH
You should be able to apply the same principles to the other bold sections of your text file.
In terms of writing your extracted data out to a CSV file, it is probably worth reading a good tutorial on this. For example Reading and Writing CSV Files in Python. There are a few different ways of storing your information before writing, such as lists vs dictionaries. But you can write CSV files either with built-in Python tools or manually.