I have a txt file with some data I want to clean and export as csv but the format is too messed up . The lines in the txt file are in this format
[email protected]:specialcode | Status - 2022-11-25
[email protected]:anothercode | Status - 2023-08-15
[email protected]:codeworcd | Status - 2036-06-19
and so one
I want to convert the lines to
[email protected] , specialcode , Status , 2022-11-25
[email protected], anothercode , Status , 2023-08-15
[email protected], codeworcd, Status, 2036-06-19
So that i can save the file as csv.
How can I approach such a complex situation? I can loop over the lines and split it with split(‘:’)
but each character is different. So it appears more challenging.
Thanks
CodePudding user response:
With your shown sample please try the following:
import csv, re
with open("file.txt") as fi, open("output.csv", "w") as fo:
writer = csv.writer(fo)
for line in fi:
l = re.split(r':| [|-] ', line.rstrip())
writer.writerow(l)
Result:
[email protected],specialcode,Status,2022-11-25
[email protected],anothercode,Status,2023-08-15
[email protected],codeworcd,Status,2036-06-19
- It assumes the input filename is "file.txt" and the output filename is "output.csv".
- The delimiter is defined as
:| [|-]
. It splits the line on a colon or a sequence of a whitespace, pipe or hyphen, and a whitespace. The important thing is the pipe character and hyphen are surrounded by whitespaces as shown in your sample.
CodePudding user response:
here is one way to do it without regex
assuming that you have a text file in filesystem, that you are or can read using read_csv
# read in the text file, and name the columns, assuming there is only one '|' in file
df=pd.read_csv(r'csv.csv', sep='|', header=None, names=['col1','col2'])
# split col1 on colon
df[['email','code']]=df['col1'].str.split(':', expand=True)
# split only one occurrence on hyphen
df[['status','date']]=df['col2'].str.split('-', 1,expand=True)
# drop the read-in columns
df=df.drop(columns=['col1','col2'])
# write to csv
df.to_csv(r'csvout.csv')
email code status date
0 [email protected] specialcode Status 2022-11-25
1 [email protected] anothercode Status 2023-08-15
2 [email protected] codeworcd Status 2036-06-19