How to split 4 million excel data python for help-CodePudding

Company has about 4 million data, divided into any number of tables, each table has hundreds of thousands of lines, now these tables to split into each table 1000 rows of data deletion of excel table, import nailing on the three parties of CRM, how to fast and efficient complete? Please tall person to give directions, the score is not enough, can pay WeChat,

CodePudding user response:

Outsiders don't know how to classification of data, can't help you
You can't say, leak company data
Python can read the database
Through the limit to control the amount of reading
Then according to the structure, write excl file

CodePudding user response:

references on 1st floor day I reply:

outsiders don't know how to classification of data, can't help you
You can't say, leak company data
Python can read the database
Through the limit to control the amount of reading
Then according to the structure, write excl file

has not been used python is achievable play?

CodePudding user response:

refer to the second floor weixin_45014606 response:

Quote: refer to 1st floor day on my reply:
outsiders don't know how to classification of data, can't help you
You can't say, leak company data
Python can read the database
Through the limit to control the amount of reading
Then according to the structure, write excl file

has not been used python is achievable play?

Is certainly possible, but I don't know what is before your database design, database server have? Separation is supported, speaking, reading and writing, as well as the load balance and so on! To fast and efficient, it must first consider the database query efficiency problem:
Such as to avoid the use of select * query statements, but should select field1, field2,... Such as query,
Under the condition of limited hardware resources, can consider to join the multithreaded way to deal with!
The key is you want to split the business to carry on the scientific and reasonable!

CodePudding user response:

I have a similar work, is to generate CSV, guide to hadoop,

Write the name of the table to generate a text file, the script by table export, per 10 million lines create a CSV file,
Finally become tbl_a_1. CSV tbl_a_2. CSV,,, such a file list,

Perform work every day, 10 days, guide the three terabytes of data,

CodePudding user response:

The data volume is not large, pandas should be more than a dozen lines of code, a few minutes to open the excel should be came out

CodePudding user response:

reference paullbm reply: 3/f

Quote: refer to the second floor weixin_45014606 response:

Quote: refer to 1st floor day on my reply:
outsiders don't know how to classification of data, can't help you
You can't say, leak company data
Python can read the database
Through the limit to control the amount of reading
Then according to the structure, write excl file

has not been used python is achievable play?

Is certainly possible, but I don't know what is before your database design, database server have? Separation is supported, speaking, reading and writing, as well as the load balance and so on! To fast and efficient, it must first consider the database query efficiency problem:
Such as to avoid the use of select * query statements, but should select field1, field2,... Such as query,
Under the condition of limited hardware resources, can consider to join the multithreaded way to deal with!
The key is you want to split the business to carry on the scientific and reasonable!

no server, data in excel table now, is to lead into the CRM software unified management

CodePudding user response:

cited the 4th floor old coconut reply:

I have a similar work, is to generate CSV, guide to hadoop,

Write the name of the table to generate a text file, the script by table export, per 10 million lines create a CSV file,
Finally become tbl_a_1. CSV tbl_a_2. CSV,,, such a file list,

Perform work every day, 10 days, guide the three terabytes of data,

I didn't so much, this probably 4 G, manual guide for a few days, the efficiency is too low

CodePudding user response:

refer to fifth floor ice of wind response:

the data volume is not large, pandas should be more than a dozen lines of code, a few minutes to open the excel should come out

the used, consult

CodePudding user response:

In the case of a single excel file

 
The import pandas as pd 
The import OS 
Filename="d:/my_files/ABC. XLSX 
Df=pd read_excel (filename) 
Row_num=int (df) shape [0]) 
For I in range (0, row_num, 1000) : 
J=I + 1000 if I + 1000 & lt; Row_num else row_num 
Df [I: j] to_excel (' d:/my_convert_files/% s to % s' % (int (I/1000), OS. The path. The basename (filename)))

If it is a set of files in a directory

 
The import pandas as pd 
The import OS 

Def split_file (fullfilename) : 
Df=pd read_excel (fullfilename) 
Row_num=int (df) shape [0]) 
For I in range (0, row_num, 1000) : 
J=I + 1000 if I + 1000 & lt; Row_num else row_num 
Df [I: j] to_excel (' d:/my_convert_files/% s to % s' % (int (I/1000), OS. The path. The basename (filename))) 

Folder="d:/my_files' 
Files=OS. Listdir (folder) 
For filename in files: 
Split_file (OS) path) join (folder, filename))

Don't know if I understand right

CodePudding user response:

With pandas to read, and then break up is ok