Python project - Writing contents of .txt file to Pandas dataframe-CodePudding

I'm currently working on a Python project where I want to:

Loop through subdirectories of a root directory
Find .txt files with names starting with 'memory_'. Txt files are: newline-separated, lines consist of: 'colName: Value' pairs. Like this.

Memory dump

Serialnr: 1412b23990

Date/time: 24-11-2016 08:10



mode: version

Hardware release: ic2kkit01*P131113*

Software release: V3.82

Rom test 1 checksum: e0251fda

Rom test 2 checksum: cae0351f



mode: statistics

Line power connected (hours): 360

Line power disconnected (number of times): 2

Ch function(hours): 54

Dhw function(hours): 4

Burnerstarts (number of times): 604

Ignition failed (number of times): 0

Flame lost (number of times): 0

Reset (number of times): 0



mode: status

T1: 17.42

T2: 17.4

T3: 16.38

T4: -35.0

T5: -35.0

T6: 17.4

Temp_set: 0.0

Fanspeed_set: 0.0

Fanspeed: 0.0

Fan_pwm: 0.0

Opentherm: 0

Roomtherm: 0

Tap_switch: 0

Appending the contents of the .txt file to a Pandas data frame with predefined column names. I.e.: I would like to write each .txt file into a data frame row, using the colName:Value pairs. See attached image for how (a part of) the data frame should look like.

Current .py code:

import os
import pandas as pd

# Set rootdir for os.walk
rootdir = 'K:/Retouren' 

## Create empty Pandas dataframe with just column names
column_names = ["Memory dump", "Serialnr", "Date/time", "mode", "Hardware release", "Software release", "Rom test 1 checksum", "Rom test 2 checksum", 
    "mode", "Line power connected (hours)", "Line power disconnected (number of times)", "Ch function(hours)", "Dhw function(hours)", "Burnerstarts (number of times)", 
    "Ignition failed (number of times)", "Flame lost (number of times)", "Reset (number of times)", "Gasmeter_ch", "Gasmeter_dhw", "Watermeter", "Burnerstarts_dhw", 
    "mode", "T1", "T2", "T3", "T4", "T5", "T6", "Temp_set", "Fanspeed_set", "Fanspeed", "Fan_pwm", "Opentherm", "Roomtherm", "Tap_switch", "Gp_switch", "Pump", "Dwk", 
    "Gasvalve", "Io_signal", "Spark", "Io_curr", "Displ_code", "Ch_pressure", "Rf_rth_bound", "Rf_rth_communication", "Rf_rth_battery_info", "Rf_rth_battery_ok", 
    "Bc_tapflow", "Pump_pwm", "Room_override_zone1", "Room_set_zone1", "Room_temp_zone1", "Room_override_zone2", "Room_set_zone2", "Room_temp_zone2", "Outside_temp", 
    "Ot_master_member_id", "Ot_therm_prod_version", "Ot_therm_prod_type", "mode", "Node nr", "Cloud id0", "Cloud id1", "Cloud id2", "Rf cloud nr", "Rssi_lower", 
    "Rssi_upper", "Rssi_wait", "Attention_period", "Attention_number", "Info10", "Info11", "Info12", "Info13", "Info14", "Info15", "Info16", "Info17", "Info18", 
    "Ramses_thermostat_idh", "Ramses_thermostat_idm", "Ramses_thermostat_idl", "Ramses_boiler_idh", "Ramses_boiler_idm", "Ramses_boiler_idl", "Prod. token", 
    "Year", "Month", "Line number", "Serial1", "Serial2", "Serial3", "mode", "Id_dongle0", "Id_dongle1", "Id_dongle2", "Id_dongle3", "Id_lan0", "Id_lan1", 
    "Id_lan2", "Id_lan3", "Info2_7", "Info2_8", "Info2_9", "Info2_10", "Info2_11", "Info2_12", "Info2_13", "Info2_14", "mode", "Interrupt_time", 
    "Interrupt_load (%)", "Main_load (%)", "Net fequency (hz)", "Voltage ref. (v)", "Checksum1", "Checksum2", "nmode", "Fault 0", "Fault 1", "Fault 2", 
    "Fault 3", "Fault 4", "Fault 5", "Fault 6", "Fault 7", "Fault 8", "Fault 9", "Fault 10", "Fault 11", "Fault 12", "Fault 13", "Fault 14", "Fault 15", 
    "Fault 16", "Fault 17", "Fault 18", "Fault 19", "Fault 20", "Fault 21", "Fault 22", "Fault 23", "Fault 24", "Fault 25", "Fault 26", "Fault 27", "Fault 28", 
    "Fault 29", "Fault 30", "Fault 31", "mode", "Heater_on", "Comfort_mode", "Ch_set_max", "Dhw_set", "Eco_days", "Comfort_set", "Dhw_at_night", "Ch_at_night", 
    "Parameter 1", "Parameter 2", "Parameter 3", "Parameter 4", "Parameter 5", "Parameter 6", "Parameter 7", "Parameter 8", "Parameter 9", "Parameter a", 
    "Parameter b", "Parameter c", "Parameter c", "Parameter d", "Parameter e", "Parameter e.", "Parameter f", "Parameter h", "Parameter n", "Parameter o", 
    "Parameter p", "Parameter r", "Parameter f.", "mode", "Param31", "Param32", "Param33", "Param34", "Param35", "Param36", "Param37", "Param38", "Param39", 
    "Param40", "Param41", "Param42", "Param43", "Param44", "Param45", "Param46", "Param47", "Param48", "Param49", "Param50", "Param51", "Param52", "Param53", 
    "Param54", "Param55", "Param56", "Param57", "Param58", "Param59", "Param60", "Param61", "Param62", "Param63"]
data = pd.DataFrame(columns = column_names)

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        if file.startswith('memory_') and os.path.splitext(file)[1] == '.txt':
            filepath = os.path.join(subdir, file)
            with open (filepath, "r") as curfile:
                data.append()  
                ## Here is where I would like to append the .txt data as a row in the data frame

I have the first two steps down, but the third is exceeding my programming knowledge. Any tips would be greatly appreciated.

Example of the desired dataframe:

Example of the desired dataframe

CodePudding user response：

I suggest reading the file with readlines(), which will return a list of lines. Then loop over the lines and process only those that contain : in the string. Split by the colon (and trailing whitespace) while wrapping everything in dict() will create a dictionary with the strings before the colon as keys and the strings after the colons as values:

dict(i.split(': ',1) for i in curfile.readlines() if ':' in i)

for your sample data this would make:

{'Serialnr': '1412b23990', 'Date/time': '24-11-2016 08:10', 'mode': 'status', 'Hardware release': 'ic2kkit01*P131113*', 'Software release': 'V3.82', 'Rom test 1 checksum': 'e0251fda', 'Rom test 2 checksum': 'cae0351f', 'Line power connected (hours)': '360', 'Line power disconnected (number of times)': '2', 'Ch function(hours)': '54', 'Dhw function(hours)': '4', 'Burnerstarts (number of times)': '604', 'Ignition failed (number of times)': '0', 'Flame lost (number of times)': '0', 'Reset (number of times)': '0', 'T1': '17.42', 'T2': '17.4', 'T3': '16.38', 'T4': '-35.0', 'T5': '-35.0', 'T6': '17.4', 'Temp_set': '0.0', 'Fanspeed_set': '0.0', 'Fanspeed': '0.0', 'Fan_pwm': '0.0', 'Opentherm': '0', 'Roomtherm': '0', 'Tap_switch': '0'}

If you create an empty list before the loop, and append the dictionaries to that list within the loop, you'll end up with a list of dicts that you can load with pandas in one go:

import os
import pandas as pd

# Set rootdir for os.walk
rootdir = 'K:/Retouren' 

## Create empty list
data = []

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        if file.startswith('memory_') and os.path.splitext(file)[1] == '.txt':
            filepath = os.path.join(subdir, file)
            with open (filepath, "r") as curfile:
                data.append(dict(i.split(': ',1) for i in curfile.readlines() if ':' in i))  
            
df = pd.DataFrame(data)

An added advantage is that you don't need to set the column names manually, because pandas will use the dict keys for that. DataFrame:

	Serialnr	Date/time	mode	Hardware release	Software release	Rom test 1 checksum	Rom test 2 checksum	Line power connected (hours)	Line power disconnected (number of times)	Ch function(hours)	Dhw function(hours)	Burnerstarts (number of times)	Ignition failed (number of times)	Flame lost (number of times)	Reset (number of times)	T1	T2	T3	T4	T5	T6	Temp_set	Fanspeed_set	Fanspeed	Fan_pwm	Opentherm	Roomtherm	Tap_switch
0	1412b23990	24-11-2016 08:10	status	ic2kkit01P131113	V3.82	e0251fda	cae0351f	360	2	54	4	604	0	0	0	17.42	17.4	16.38	-35	-35	17.4	0	0	0	0	0	0	0

There is one disadvantage: as a dict can only contain unique keys you will lose two mode values. I'll leave it as they seem to be headers rather than containers of information. Otherwise it would require some additional renaming.

CodePudding user response：

I think you're saying that you can already get your text file into a list of lines - so keeping this super simple, from a list of lines from your text file - let's call it lineList - just do this:
splitList = [l.split(":") for l in lineList if len(l)>0]
then
someDF = pd.DataFrame(splitList, index=[i[0] for i in splitList])[[1]].T
to get

Notes:

lines with multiple colons (e.g. datetimes) will need special handling because you'll get three (or more) elements from your split
using "list comprehensions" here, as they are one of the simplest way of working with lists in Python
Pandas will automatically give columns numerical 'names', so that's why you can ask for [[1]] - it's the second column (the first one we discard because you want that as your index
.T switches your dataframe from columns to rows
there are better, more sophisticated ways of doing this, but keeping this simple
you could handle multi-: values with something like this:
":".join("Date/time: 24-11-2016 08:10".split(":")[1:])
which gives you:
' 24-11-2016 08:10'

In the context of that first line above, you would thus adapt that method like so:

splitList = [[l.split(":")[0],":".join(l.split(":")[1:])] for l in lineList if len(l)>0]

which gives you:

 [['Memory dump', ''],
 ['Serialnr', ' 1412b23990'],
 ['Date/time', ' 24-11-2016 08:10'],
 ['mode', ' version'],
 ['Hardware release', ' ic2kkit01*P131113*'],
 ['Software release', ' V3.82'],
 ['Rom test 1 checksum', ' e0251fda'],
 ['Rom test 2 checksum', ' cae0351f'],
 ['mode', ' statistics'],
 ['Line power connected (hours)', ' 360'],
 ['Line power disconnected (number of times)', ' 2'],
 ['Ch function(hours)', ' 54'],
 ['Dhw function(hours)', ' 4'],
 ['Burnerstarts (number of times)', ' 604'],
 ['Ignition failed (number of times)', ' 0'],
 ['Flame lost (number of times)', ' 0'],
 ['Reset (number of times)', ' 0'],
 ['mode', ' status'],
 ['T1', ' 17.42'],
 ['T2', ' 17.4'],
 ['T3', ' 16.38'],
 ['T4', ' -35.0'],
 ['T5', ' -35.0'],
 ['T6', ' 17.4'],
 ['Temp_set', ' 0.0'],
 ['Fanspeed_set', ' 0.0'],
 ['Fanspeed', ' 0.0'],
 ['Fan_pwm', ' 0.0'],
 ['Opentherm', ' 0'],
 ['Roomtherm', ' 0'],
 ['Tap_switch', ' 0']]

To build out a DataFrame of many lines, you'd append each of the one-row dataframes to a list (e.g. dfList.append(someDF) ), and then use concatenation at the end (e.g. comboDF = pd.concat(dfList) )

So putting that all together - given a list of lines from a text file:

dataFrameList = []
splitList = [[l.split(":")[0],":".join(l.split(":")[1:])] for l in lineList if len(l)>0]
oneRowDF = pd.DataFrame(data = [i[1] for i in splitList], index=[i[0] for i in splitList]).T
dataFrameList.append(oneRowDF)

Then when you've got them all loaded into your list of one-row dataframes:

comboDF = pd.concat(dataFrameList)