splitting a flatfile using timestamp as new line instead of \n-CodePudding

I have a flatfile and am trying to create a dataframe using read_csv and I'd like to use a timestamp as the spot where a new row starts instead of a new line. I don't want to use \n because when there's an error in the flatfile, the errors take up multiple lines, but they still start with a single timestamp. I tried:

df = pd.read_csv('myfile.file',header=None,sep='\d{4}-\d{2}-\d{2}')

This does a couple of things I don't want it to do - first, it deletes the yyyy-mm-dd out entirely leaving just the time part (I need to keep it), and second it didn't seem to actually split only at the timestamp, it still is splitting it up where it finds \n

Example: normal line:

2022-05-16 hh:mm:ss here's a normal line in the file \n

error line:

2022-05-16 hh:mm:ss here's an error \nhere's error details \nit's a new line even though it's the same error \nbut it just has one timestamp

normal line current output:

hh:mm:ss here's a normal line in the file

normal line desired output:

2022-05-16 hh:mm:ss here's a normal line in the file

error line current output:

hh:mm:ss here's an error
here's error details
it's a new line even though it's the same error
but it just has one timestamp

error line desired output:

2022-05-16 hh:mm:ss here's an error here's error details it's a new line even though it's the same error but it just has one timestamp

CodePudding user response：

One way would be to read in the file and then fix it. This approach finds all the lines that start with timestamps, groups by them and then concatenates all the strings in that group.

The data:

str_io = io.StringIO(
'''2022-05-16 01:mm:ss here's a normal line in the file
2022-05-16 02:mm:ss here's an error
 here's error details
 it's a new line even though it's the same error
 but it just has one timestamp
2022-05-16 03:mm:ss here's a another normal line in the file
2022-05-16 04:mm:ss here's a third normal line in the file'''
)

(You'll have to mess with setting sep= for your use-case). Can no longer do sep='\n' in 1.4.2. You can use a char that doesn't exist in the file - like back-tick - to read entire lines.

df = pd.read_csv(str_io, header=None)
df

0  2022-05-16 01:mm:ss here's a normal line in th...
1                2022-05-16 02:mm:ss here's an error
2                               here's error details
3    it's a new line even though it's the same error
4                      but it just has one timestamp
5  2022-05-16 03:mm:ss here's a another normal li...
6  2022-05-16 04:mm:ss here's a third normal line...

Find the rows that start with a timestamp (you will think of something better for your use-case than "starts with a '2'":

ts_rows = df[0].str.startswith('2')

Then accumulate the error rows text and attach them to the previous timestamped line:

df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
    .apply(lambda x: x[0].str.cat(sep=' ')).reset_index(drop=True)

0    2022-05-16 01:mm:ss here's a normal line in th...
1    2022-05-16 02:mm:ss here's an error  here's er...
2    2022-05-16 03:mm:ss here's a another normal li...
3    2022-05-16 04:mm:ss here's a third normal line...

This shows that the error lines were accumulated:

df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
    .apply(lambda x: x[0].str.cat(sep=' '))[1]
    
"2022-05-16 02:mm:ss here's an error  here's error details  it's a new line even though it's the same error  but it just has one timestamp"

Then you can break up the lines into columns like ts and message or whatever you need for your specific application.