I am facing an issue while trying to find the right regex expression for the read_csv delimiter. My initial txt data looks something like this.
t = '''
[21.01.22, 07:32:11] text1
text2
[21.01.22, 07:34:18] text3
[21.01.22, 07:32:51] text4
text5
'''
I need to separate the rows by the newline and the square bracket expression, so that the desired outcome looks like this
column 1 | column2
[21.01.22, 07:32:11] | text1 text2
[21.01.22, 07:34:18] | text3
[21.01.22, 07:32:51] | text4 text5
The problem I am currently struggling to solve is that some lines contain strings without sqaure brackets. The text inside the square brackets is always of the same format: [dd.mm.yy, hh:mm:ss]
Can you please help me find the right regex expression for the delimiter parameter?
data = pd.read_csv('t.txt', delimiter=r"\[(..................)\]", header=None, engine="python")
CodePudding user response:
Try (regex101):
import re
import pandas as pd
t = """
[21.01.22, 07:32:11] text1
text2
[21.01.22, 07:34:18] text3
[21.01.22, 07:32:51] text4
text5
"""
df = pd.DataFrame(
re.findall(r"^(\[[^]] \])(.*?)(?=^\[|\Z)", t, flags=re.S | re.M),
columns=["Column1", "Column2"],
)
df["Column2"] = df["Column2"].str.replace("\n", " ").str.strip()
print(df)
Prints:
Column1 Column2
0 [21.01.22, 07:32:11] text1 text2
1 [21.01.22, 07:34:18] text3
2 [21.01.22, 07:32:51] text4 text5
CodePudding user response:
maynot be elegant, but seems to work
# readin the file
lines=''
with open("c:\csv2.txt") as fi:
line=fi.read()
lines = line
#replace newline with space, so that we have a single string
lines=re.sub(r'(\n) ',' ', lines)
# add few delimiters to help split up the lines at set locations
# workaround: add | delimiter before [
lines=re.sub(r'( \[) ','|[', lines)
#workaround: add ; delimiter after ]
lines=re.sub(r'(\] ) ','];', lines)
# create a dataframe by splitting on | delimiter
df1=pd.DataFrame(lines.split('|'))
# split again on ; delimiter and create new columns
df1[['column1','columns2']]= df1[0].str.split(";", expand=True)
# drop the originally read-in column
df1.drop(columns=[0], inplace=True)
df1
column1 columns2
0 [21.01.22, 07:32:11] text1 text2
1 [21.01.22, 07:34:18] text3
2 [21.01.22, 07:32:51] text4 text5