Home > Net >  Read CSV file with Pandas: Regex delimiter
Read CSV file with Pandas: Regex delimiter

Time:10-15

I am facing an issue while trying to find the right regex expression for the read_csv delimiter. My initial txt data looks something like this.

t = '''
[21.01.22, 07:32:11] text1
text2
[21.01.22, 07:34:18] text3
[21.01.22, 07:32:51] text4
text5
'''

I need to separate the rows by the newline and the square bracket expression, so that the desired outcome looks like this

column 1 | column2
[21.01.22, 07:32:11] | text1 text2
[21.01.22, 07:34:18] | text3
[21.01.22, 07:32:51] | text4 text5

The problem I am currently struggling to solve is that some lines contain strings without sqaure brackets. The text inside the square brackets is always of the same format: [dd.mm.yy, hh:mm:ss]

Can you please help me find the right regex expression for the delimiter parameter?

data = pd.read_csv('t.txt', delimiter=r"\[(..................)\]", header=None, engine="python")

CodePudding user response:

Try (regex101):

import re
import pandas as pd

t = """
[21.01.22, 07:32:11] text1
text2
[21.01.22, 07:34:18] text3
[21.01.22, 07:32:51] text4
text5
"""

df = pd.DataFrame(
    re.findall(r"^(\[[^]] \])(.*?)(?=^\[|\Z)", t, flags=re.S | re.M),
    columns=["Column1", "Column2"],
)
df["Column2"] = df["Column2"].str.replace("\n", " ").str.strip()
print(df)

Prints:

                Column1      Column2
0  [21.01.22, 07:32:11]  text1 text2
1  [21.01.22, 07:34:18]        text3
2  [21.01.22, 07:32:51]  text4 text5

CodePudding user response:

maynot be elegant, but seems to work

# readin the file
lines=''
with open("c:\csv2.txt") as fi:  
    line=fi.read()
    lines  = line

#replace newline with space, so that we have a single string
lines=re.sub(r'(\n) ',' ', lines)

# add few delimiters to help split up the lines at set locations
# workaround: add | delimiter before [
lines=re.sub(r'( \[) ','|[', lines)

#workaround: add ; delimiter after ]
lines=re.sub(r'(\] ) ','];', lines)

# create a dataframe by splitting on | delimiter
df1=pd.DataFrame(lines.split('|'))

# split again on ; delimiter and create new columns
df1[['column1','columns2']]= df1[0].str.split(";", expand=True) 

# drop the originally read-in column
df1.drop(columns=[0], inplace=True)
df1

    column1                 columns2
0   [21.01.22, 07:32:11]    text1 text2
1   [21.01.22, 07:34:18]    text3
2   [21.01.22, 07:32:51]    text4 text5
  • Related