I have a dataframe 15000 rows of binary data, with each string being 365 characters. And I convert each binary numbers to 365 days with a starting date of 13/12/2020.
Because the data is so large, so my program runs very slowly. Is there a way I can optimize my program?
Data example:
ID | Nature | Binary |
---|---|---|
1122 | M | 1001100100100010010001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010011001001100100110010011001000000100110011011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010000001001100100110010011001001100 |
Output:
ID | Nature | Date | Code |
---|---|---|---|
1122 | M | 13/12/2020 | 1 |
1122 | M | 14/12/2020 | 0 |
1122 | M | .......... | ... |
1122 | M | 11/12/2021 | 0 |
Code:
start_date = '2021-12-13'
table_ = pd.DataFrame({'ID': df.id[0],'Nature':df.Nature[0], Date':pd.date_range(start_date, periods=len(df.binairy[0]), freq='D'), 'Code': list(df.binairy[0])})
for i in range(1,len(df)):
table_i = pd.DataFrame({'ID': df.id[i],'Nature':df.Nature[i],'Date':pd.date_range(start_date, periods=len(df.binairy[i]), freq='D'), 'Code': list(df.binairy[i]})
table_ = pd.concat([table_,table_i],ignore_index=True)
table_
CodePudding user response:
the best way to optimize the computation time is to parallelize your process, using multiprocessing library if you have multiple cores and/or multithreading (i guess you're using a cpu based environment).
CodePudding user response:
Do you have to process the data in the dataframe, or can you load it in to a database?
You can split the strings of 1's and 0's in to rows with dates using a number table. For this implementation I borrowed the number table generator from this answer on SO; so the following assumes you have those views defined.
Create a table to hold your source data
create table sample_data (
id int,
nature char(1),
bin_str varchar(365)
);
For testing I loaded 2500 rows by duplicating a row
insert sample_data(id, nature, bin_str) values (1,'M','1001100100100010010001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010011001001100100110010011001000000100110011011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010000001001100100110010011001001100');
insert sample_data(id, nature, bin_str)
select n, nature, bin_str
from sample_data s join generator_4k g
where g.n>1 and g.n<=2500;
Then split the binary string and add dates
select id,
nature,
date_add('2020-12-13', INTERVAL n DAY) date,
substring(bin_str, n, 1) code
from generator_4k
join sample_data
where generator_4k.n > 0 and generator_4k.n <= length(bin_str)
order by id, n;
id | nature | date | code |
---|---|---|---|
1 | M | 2020-12-14 | 1 |
1 | M | 2020-12-15 | 0 |
1 | M | 2020-12-16 | 0 |
. | . | .......... | . |
1 | M | 2021-12-12 | 0 |
It took my local machine a few seconds to process 25000 rows, so depending on how slow your existing solution is YMMV.