Being FTP file to read and check-CodePudding

Is there such a demand:
1. The data in the database has more than ten million, more than a table after the query will get files on the FTP service storage paths,
2. Now you need to check through the storage path on the FTP file does not exist, if the file exists, then need to determine whether a file damage depending on the type of file
Now, ten million data circulating flow request access to the file on the FTP, next reoccupy calibration method, finish for each loop flow check again need to 2 seconds, if single thread ran a couple of months, a multi-threaded run and will be out of memory, even reduce the thread, to avoid memory leaks, will continue to run a few days, and the number of connections limits FTP service, can only use the URL for the file stream, but ran after a period of time, the URL access to flow more slowly, don't know if you have good ideas, can greatly improve the efficiency of

 
//ftp://URL the URL=new URL (new String (fullPath. GetBytes (" GBK "), "GB2312")); 
URLConnection con=url. The openConnection (); 
InputStream=con. GetInputStream ();

CodePudding user response:

Run check on FTP file server, the result is submitted to a file or database

CodePudding user response:

File check use what way? Check if the file down down, will be a waste of time, because most of the time spent in the download process,
The general idea is as follows:
1, FPT file on the server has its own hash check values to visit
2, the records in the database have corresponding hash available
3, not the file download, but the hash value to directly compare the two

CodePudding user response:

reference 1st floor tianfang response:

run check on FTP file server, results submitted to the file or database

FTP server is independent, the customer not to use,

CodePudding user response:

refer to the second floor datafansbj response:

file check use what way? Check if the file down down, will be a waste of time, because most of the time spent in the download process,
The general idea is as follows:
1, FPT file on the server has its own hash check values to visit
2, the records in the database have corresponding hash available
3, not the file download, but directly compare the two hash value

The file on the FTP is very chaotic very miscellaneous, spanning half a century, but also before operation is not standard, all now has only one file FTP address is known, and other all have no, calibration is not download file integrity check, check file can be opened, such as check in JPG, the width of reading pictures from the document flow, no error is good, or it is damaged files, without any reference data, and file sizes and have several times to hundreds of megabytes

CodePudding user response:

Concurrent download sync to a new server, and then synchronization do check

reference A_Lonely_Smile reply: 3/f

Quote: refer to 1st floor tianfang response:
run check on FTP file server, results submitted to the file or database

FTP server is independent, the customer not to use,

First solve system level, performance is much higher,
1 use external hard drive copy out;
2 use rsync to complete directory synchronization out to a new server (large capacity on a PC), and then run on the new server,

Must remote address: two thread pool, then use a download thread pool, an analysis of the thread pool, download the thread pool can FTP IO fill, thread pool analysis file and submit state

File information add a few fields, downloaded, local path, the results of the analysis, has been removed, download management cycle never download to get some, assign a worker thread, management will have downloaded file information to analysis the worker thread, submit results after analysis (best to calculate a hash and stored), and remove the file