I'm trying to recursively move files from an SFTP server to S3, possibly using boto3
. I want to preserve the folder/file structure as well. I was looking to do it this way:
import pysftp
private_key = "/mnt/results/sftpkey"
srv = pysftp.Connection(host="server.com", username="user1", private_key=private_key)
srv.get_r("/mnt/folder", "./output_folder")
Then take those files and upload them to S3 using boto3
. However, the folders and files on the server are numerous with deep levels and also large in size. So my machine ends up running out of memory and disk space. I was thinking of a script where I could download single files and upload single files and then delete and repeat.
I know this would take a long time to finish, but I can run this as a job without running out of space and not keep my machine open the entire time. Has anyone done something similar? Any help is appreciated!
CodePudding user response:
If you can't (or don't want) to download all of the files at once before sending them to S3, then you need to download them one at a time.
Further, from there, it follows that you'll need to build a list of files to download, then work on them, transferring one file to your local computer, then sending it to S3.
A very simple version of this would look something like this:
import pysftp
import stat
import boto3
import os
import json
# S3 bucket and prefix to upload to
target_bucket = "example-bucket"
target_prefix = ""
# Root FTP folder to sync
base_path = "./"
# Both base_path and target_prefix should end in a "/"
# Or, for the prefix, be empty for the root of the bucket
srv = pysftp.Connection(
host="server.com",
username="user1",
private_key="/mnt/results/sftpkey",
)
if os.path.isfile("all_files.json"):
# No need to cache files more than once. This lets us restart
# on a failure, though really we should be caching files in
# something more robust than just a json file
with open("all_files.json") as f:
all_files = json.load(f)
else:
# No local cache, go ahead and get the files
print("Need to get list of files...")
todo = [(base_path, target_prefix)]
all_files = []
while len(todo):
cur_dir, cur_prefix = todo.pop(0)
print("Listing " cur_dir)
for cur in srv.listdir_attr(cur_dir):
if stat.S_ISDIR(cur.st_mode):
# A directory, so walk into it
todo.append((cur_dir cur.filename "/", cur_prefix cur.filename "/"))
else:
# A file, just add it to our cache
all_files.append([cur_dir cur.filename, cur_prefix cur.filename])
# Save the cache out to disk
with open("all_files.json", "w") as f:
json.dump(all_files, f)
# And now, for every file in the cache, download it
# and turn around and upload it to S3
s3 = boto3.client('s3')
while len(all_files):
ftp_file, s3_name = all_files.pop(0)
print("Downloading " ftp_file)
srv.get(ftp_file, "_temp_")
print("Uploading " s3_name)
s3.upload_file("_temp_", target_bucket, s3_name)
# Clean up, and update the cache with one less file
os.unlink("_temp_")
with open("all_files.json", "w") as f:
json.dump(all_files, f)
srv.close()
Error checking, and speed improvements are obviously possible.
CodePudding user response:
You have to do it file-by-file.
Start with the recursive download code here:
Python pysftp get_r from Linux works fine on Linux but not on Windows
After each sftp.get
, do S3 upload and remove the file.
Actually you can even copy the file from SFTP to S# without storing the file locally:
Transfer file from SFTP to S3 using Paramiko