Let's say I have this folder :
Data
In this folder there are a number of files.
Let's say I want to do this :
1: I want to randomly select 25% of Data's files and store them for example in a folder named '75'.
2: Then I want to increase the percentage. I want to randomly select 50% of Data's files and store them for example in a folder named '50'.
Now this 50% selected from Data must include the 25% already selected previsouly in 1: plus another 25% new random ones.
Here's what I have tried :
def getPercentageData(data_path, out_path, percent):
files= os.listdir(data_path)
files_to_keep = round(len(files) * percent)
for file_name in random.sample(files, files_to_keep):
shutil.copy(os.path.join(data_path, file_name), out_path)
But this does not select the same files.
CodePudding user response:
When sampling the second time, you can sample the same number of files as you did for your first sample, but choose them from a list that excludes the files present in the first sample. Then simply merge the first sample into your second.
This should work (here, the letters represent your file names):
import string
files = list(string.ascii_letters) # placeholder list representing your file paths
import random
percent = 0.25
files_to_keep = round(len(files) * percent)
first_sample = random.sample(files, files_to_keep)
available_files = [f for f in files if f not in first_sample]
second_sample = first_sample random.sample(available_files, files_to_keep)
print(first_sample)
# output (in my case):
# ['R', 'd', 'h', 'N', 'H', 'I', 'w', 'y', 'u', 'm', 'D', 'Y', 'r']
print(second_sample)
# output (in my case):
# ['R', 'd', 'h', 'N', 'H', 'I', 'w', 'y', 'u', 'm', 'D', 'Y', 'r', 'T', 'E', 'F', 'i', 'q', 'A', 'C', 's', 'G', 'z', 'b', 'M', 'l']
CodePudding user response:
Create a class that will hold the state of the last copied files. This way, we can either:
- Use the same set of files if the percentage didn't change
- Add new files if the percentage increased
- Remove some files if the percentage decreased
import os
import random
import shutil
class FileMover:
def __init__(self):
self.files = set()
def getPercentageData(self, src_path, dst_path, percent):
os.makedirs(dst_path, exist_ok=True)
src_files= set(os.listdir(src_path))
files_to_copy = round(len(src_files) * percent)
if files_to_copy > len(self.files):
add_count = files_to_copy - len(self.files)
add_items = random.sample(src_files - self.files, add_count)
self.files.update(set(add_items))
elif files_to_copy < len(self.files):
new_items = random.sample(self.files, files_to_copy)
self.files = set(new_items)
for file_name in self.files:
shutil.copy(os.path.join(src_path, file_name), dst_path)
print(percent, self.files)
mover = FileMover()
for percent in [
0.3,
0.7,
0.5,
0.6,
0.6,
0.1,
]:
mover.getPercentageData("./src", f"./dst/{percent}", percent)
For simplicity, let's say we have 10 files from the src
directory so that it easily maps 0.3 to 3 files, 0.7 to 7 files, and so on:
$ tree
.
├── script.py
└── src
├── 10.txt
├── 1.txt
├── 2.txt
├── 3.txt
├── 4.txt
├── 5.txt
├── 6.txt
├── 7.txt
├── 8.txt
└── 9.txt
1 directory, 11 files
Now let's run the script:
$ python script.py # I manually sorted the data below for a clearer output
0.3 {'3.txt', '8.txt', '9.txt'}
0.7 {'2.txt', '3.txt', '5.txt', '7.txt', '8.txt', '9.txt', '10.txt'}
0.5 {'2.txt', '3.txt', '5.txt', '7.txt', '8.txt'}
0.6 {'1.txt', '2.txt', '3.txt', '5.txt', '7.txt', '8.txt'}
0.6 {'1.txt', '2.txt', '3.txt', '5.txt', '7.txt', '8.txt'}
0.1 {'5.txt'}
As you can see, it correctly considered the previous set of files, whether the percentage increased or decreased or the same. Now let's check the copied files to verify:
$ tree
.
├── dst
│ ├── 0.1
│ │ └── 5.txt
│ ├── 0.3
│ │ ├── 3.txt
│ │ ├── 8.txt
│ │ └── 9.txt
│ ├── 0.5
│ │ ├── 2.txt
│ │ ├── 3.txt
│ │ ├── 5.txt
│ │ ├── 7.txt
│ │ └── 8.txt
│ ├── 0.6
│ │ ├── 1.txt
│ │ ├── 2.txt
│ │ ├── 3.txt
│ │ ├── 5.txt
│ │ ├── 7.txt
│ │ └── 8.txt
│ └── 0.7
│ ├── 10.txt
│ ├── 2.txt
│ ├── 3.txt
│ ├── 5.txt
│ ├── 7.txt
│ ├── 8.txt
│ └── 9.txt
├── script.py
└── src
├── 10.txt
├── 1.txt
├── 2.txt
├── 3.txt
├── 4.txt
├── 5.txt
├── 6.txt
├── 7.txt
├── 8.txt
└── 9.txt
7 directories, 33 files