Optimise Code to improve performance and reduce Execution time-CodePudding

I have a perfectly working code. But, when I run a large CSV file (around 2GB) it takes about 15-20 minutes for the complete execution of the code. Is there a way I could optimise my below code to take less time to finsh execution and thus improve performance?

from csv import reader, writer
import pandas as pd

path = (r"data.csv")

data = pd.read_csv(path, header=None)

last_column = data.iloc[: , -1]

arr = [i 1 for i in range(len(last_column)-1) if (last_column[i] == 1 and last_column[i 1] == 0)]

ch_0_6 = []
ch_7_14 = []
ch_16_22 = []

with open(path, 'r') as read_obj:
    csv_reader = reader(read_obj)
    rows = list(csv_reader)

for j in arr:
    
    # Channel 1-7
    ch_0_6_init = [int(rows[j][k]) for k in range(1,8)]
    bin_num = ''.join([str(x) for x in ch_0_6_init])
    dec_num = int(f'{bin_num}', 2)
    ch_0_6.append(dec_num)
    ch_0_6_init = []

    # Channel 8-15
    ch_7_14_init = [int(rows[j][k]) for k in range(8,16)]
    bin_num = ''.join([str(x) for x in ch_7_14_init])
    dec_num = int(f'{bin_num}', 2)
    ch_7_14.append(dec_num)
    ch_7_14_init = []

    # Channel 16-22
    ch_16_22_init = [int(rows[j][k]) for k in range(16,23)]
    bin_num = ''.join([str(x) for x in ch_16_22_init])
    dec_num = int(f'{bin_num}', 2)
    ch_16_22.append(dec_num)
    ch_16_22_init = []

Sample Data:

0.0114,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,1
0.0112,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,0
0.0115,0,1,0,1,1,1,0,1,0,0,1,0,0,0,1,1,1,0,1,0,0,0,1
0.0117,0,1,0,1,1,1,0,1,0,0,1,0,0,0,1,1,1,0,1,0,0,0,0
0.0118,0,1,0,0,1,1,0,0,0,1,0,1,0,0,1,1,1,0,1,0,0,0,1

Join the binary digits to form a decimal number depending upon the channels chosen.

CodePudding user response：

Using just the csv module, you could try the following type approach:

from csv import reader, writer

ch_0_6 = []
ch_7_14 = []
ch_16_22 = []

with open('data.csv', 'r') as f_input:
    csv_input = reader(f_input)
    last_row = ['0']

    for row in csv_input:
        if last_row[-1] == '1' and row[-1] == '0':
            ch_0_6.append(int(''.join(row[1:8]), 2))
            ch_7_14.append(int(''.join(row[8:16]), 2))
            ch_16_22.append(int(''.join(row[16:23]), 2))
            
        last_row = row
    
print(ch_0_6)
print(ch_7_14)
print(ch_16_22)

For your example data this would display:

[32, 46]
[1, 145]
[104, 104]

As noted, your original approach was reading the whole file twice into memory. The first pass was just to determine which rows to parse. This can be done whilst reading by keeping track of the previous row in the loop. This alone should result in a significant speed up.

The conversion from binary list elements into decimal elements is also a bit more efficient.

This approach would also work on much larger file sizes.