Home > Software design >  How to parse the log data which is in form of nested [key=value] format using python pandas
How to parse the log data which is in form of nested [key=value] format using python pandas

Time:10-29

I have huge Sensor log data which is in form of [key=value] pair I need to parse the data column wise i found this code for my problem

import pandas as pd

lines = []
with open('/path/to/test.txt', 'r') as infile:
    for line in infile:
        if "," not in line:
            continue
        else:
            lines.append(line.strip().split(","))

row_names = []
column_data = {}

max_length = max(*[len(line) for line in lines])

for line in lines:
    while(len(line) < max_length):
        line.append(f'{len(line)-1}=NaN')

for line in lines:
    row_names.append(" ".join(line[:2]))
    for info in line[2:]:
        (k,v) = info.split("=")
        if k in column_data:
            column_data[k].append(v)
        else:
            column_data[k] = [v]

df = pd.DataFrame(column_data)
df.index = row_names
print(df)

df.to_csv('/path/to/test.csv')

the above code is suitable when the data is in form of "Priority=0, X=776517049" but my data is something like this [Priority=0][X=776517049] and there is no separator in between two columns how can i do it in python and i am sharing the link of sample data here raw data and bilow that expected parsed data which i done manually https://docs.google.com/spreadsheets/d/1EVTVL8RAkrSHhZO48xV1uEGqOzChQVf4xt7mHkTcqzs/edit?usp=sharing kindly check this link

CodePudding user response:

I've downloaded as csv.

Since your file has multiple tables on one sheet, I've limited to 100 rows, you can remove that parameter.

raw = pd.read_csv(
    "logdata - Sheet1.csv",  # filename 
    skiprows=1,  # skip the first row
    nrows=100,  # use 100 rows, remove in your example
    usecols=[0],  # only use the first column
    header=None  # your dataset has no column names
)

Then you can use a regex to extract the values:

df = raw[0].str.extract(r"\[Priority=(\d*)\] \[GPS element=\[X=(\d*)\] \[Y=(\d*)\] \[Speed=(\d*)\]")

and set column names:

df.columns = ["Priority", "X", "Y", "Speed"]

result:

  Priority          X          Y Speed
0         0  776517049  128887449     4
1         0  776516816  128887733     0
2         0  776516816  128887733     0
3         0  776516833  128887166     0
4         0  776517200  128886133     0
5         0  776516883  128885933     8
..................................... 
99        0  776494483  128908783     0
  • Related