Optimisation for data large when use loop Python-CodePudding

I have a dataframe 15000 rows of binary data, with each string being 365 characters. And I convert each binary numbers to 365 days with a starting date of 13/12/2020.

Because the data is so large, so my program runs very slowly. Is there a way I can optimize my program?

Data example:

ID	Nature	Binary
1122	M	1001100100100010010001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010011001001100100110010011001000000100110011011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010000001001100100110010011001001100

Output:

ID	Nature	Date	Code
1122	M	13/12/2020	1
1122	M	14/12/2020	0
1122	M	..........	...
1122	M	11/12/2021	0

Code:

start_date = '2021-12-13'

table_ = pd.DataFrame({'ID': df.id[0],'Nature':df.Nature[0], Date':pd.date_range(start_date, periods=len(df.binairy[0]), freq='D'), 'Code': list(df.binairy[0])})

for i in range(1,len(df)):
    table_i = pd.DataFrame({'ID': df.id[i],'Nature':df.Nature[i],'Date':pd.date_range(start_date, periods=len(df.binairy[i]), freq='D'), 'Code': list(df.binairy[i]})
    
    table_ = pd.concat([table_,table_i],ignore_index=True)

table_

CodePudding user response：

the best way to optimize the computation time is to parallelize your process, using multiprocessing library if you have multiple cores and/or multithreading (i guess you're using a cpu based environment).

CodePudding user response：

Do you have to process the data in the dataframe, or can you load it in to a database?

You can split the strings of 1's and 0's in to rows with dates using a number table. For this implementation I borrowed the number table generator from this answer on SO; so the following assumes you have those views defined.

Create a table to hold your source data

create table sample_data (
    id int,
    nature char(1),
    bin_str varchar(365)
);

For testing I loaded 2500 rows by duplicating a row

insert sample_data(id, nature, bin_str) values (1,'M','1001100100100010010001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010011001001100100110010011001000000100110011011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010000001001100100110010011001001100');

insert sample_data(id, nature, bin_str)
select n, nature, bin_str
from sample_data s join generator_4k g
where g.n>1 and g.n<=2500;

Then split the binary string and add dates

select id,
       nature,
       date_add('2020-12-13', INTERVAL n DAY) date,
       substring(bin_str, n, 1)               code
from generator_4k
         join sample_data
where generator_4k.n > 0 and generator_4k.n <= length(bin_str)
order by id, n;

id	nature	date	code
1	M	2020-12-14	1
1	M	2020-12-15	0
1	M	2020-12-16	0
.	.	..........	.
1	M	2021-12-12	0

It took my local machine a few seconds to process 25000 rows, so depending on how slow your existing solution is YMMV.