Home > OS >  Optimisation for data large when use loop Python
Optimisation for data large when use loop Python

Time:12-06

I have a dataframe 15000 rows of binary data, with each string being 365 characters. And I convert each binary numbers to 365 days with a starting date of 13/12/2020.

Because the data is so large, so my program runs very slowly. Is there a way I can optimize my program?

Data example:

ID Nature Binary
1122 M 1001100100100010010001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010011001001100100110010011001000000100110011011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010000001001100100110010011001001100

Output:

ID Nature Date Code
1122 M 13/12/2020 1
1122 M 14/12/2020 0
1122 M .......... ...
1122 M 11/12/2021 0

Code:

start_date = '2021-12-13'

table_ = pd.DataFrame({'ID': df.id[0],'Nature':df.Nature[0], Date':pd.date_range(start_date, periods=len(df.binairy[0]), freq='D'), 'Code': list(df.binairy[0])})

for i in range(1,len(df)):
    table_i = pd.DataFrame({'ID': df.id[i],'Nature':df.Nature[i],'Date':pd.date_range(start_date, periods=len(df.binairy[i]), freq='D'), 'Code': list(df.binairy[i]})
    
    table_ = pd.concat([table_,table_i],ignore_index=True)

table_

CodePudding user response:

the best way to optimize the computation time is to parallelize your process, using multiprocessing library if you have multiple cores and/or multithreading (i guess you're using a cpu based environment).

CodePudding user response:

Do you have to process the data in the dataframe, or can you load it in to a database?

You can split the strings of 1's and 0's in to rows with dates using a number table. For this implementation I borrowed the number table generator from this answer on SO; so the following assumes you have those views defined.

Create a table to hold your source data

create table sample_data (
    id int,
    nature char(1),
    bin_str varchar(365)
);

For testing I loaded 2500 rows by duplicating a row

insert sample_data(id, nature, bin_str) values (1,'M','1001100100100010010001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010011001001100100110010011001000000100110011011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010000001001100100110010011001001100');

insert sample_data(id, nature, bin_str)
select n, nature, bin_str
from sample_data s join generator_4k g
where g.n>1 and g.n<=2500;

Then split the binary string and add dates

select id,
       nature,
       date_add('2020-12-13', INTERVAL n DAY) date,
       substring(bin_str, n, 1)               code
from generator_4k
         join sample_data
where generator_4k.n > 0 and generator_4k.n <= length(bin_str)
order by id, n;
id nature date code
1 M 2020-12-14 1
1 M 2020-12-15 0
1 M 2020-12-16 0
. . .......... .
1 M 2021-12-12 0

It took my local machine a few seconds to process 25000 rows, so depending on how slow your existing solution is YMMV.

  • Related