Home > Back-end >  Run a series of randomizations while taking into account the previous result
Run a series of randomizations while taking into account the previous result

Time:10-01

Let's say I have this folder :

Data

In this folder there are a number of files.

Let's say I want to do this :

1: I want to randomly select 25% of Data's files and store them for example in a folder named '75'.

2: Then I want to increase the percentage. I want to randomly select 50% of Data's files and store them for example in a folder named '50'.

Now this 50% selected from Data must include the 25% already selected previsouly in 1: plus another 25% new random ones.

Here's what I have tried :

def getPercentageData(data_path, out_path, percent):
    files= os.listdir(data_path)
    files_to_keep = round(len(files) * percent)

    for file_name in random.sample(files, files_to_keep):
        shutil.copy(os.path.join(data_path, file_name), out_path) 

But this does not select the same files.

CodePudding user response:

When sampling the second time, you can sample the same number of files as you did for your first sample, but choose them from a list that excludes the files present in the first sample. Then simply merge the first sample into your second.

This should work (here, the letters represent your file names):

import string
files = list(string.ascii_letters)  # placeholder list representing your file paths

import random
percent = 0.25
files_to_keep = round(len(files) * percent)

first_sample = random.sample(files, files_to_keep)

available_files = [f for f in files if f not in first_sample]
second_sample = first_sample   random.sample(available_files, files_to_keep)

print(first_sample)
# output (in my case):
# ['R', 'd', 'h', 'N', 'H', 'I', 'w', 'y', 'u', 'm', 'D', 'Y', 'r']
print(second_sample)
# output (in my case):
# ['R', 'd', 'h', 'N', 'H', 'I', 'w', 'y', 'u', 'm', 'D', 'Y', 'r', 'T', 'E', 'F', 'i', 'q', 'A', 'C', 's', 'G', 'z', 'b', 'M', 'l']

CodePudding user response:

Create a class that will hold the state of the last copied files. This way, we can either:

  • Use the same set of files if the percentage didn't change
  • Add new files if the percentage increased
  • Remove some files if the percentage decreased
import os
import random
import shutil


class FileMover:
    def __init__(self):
        self.files = set()

    def getPercentageData(self, src_path, dst_path, percent):
        os.makedirs(dst_path, exist_ok=True)

        src_files= set(os.listdir(src_path))
        files_to_copy = round(len(src_files) * percent)

        if files_to_copy > len(self.files):
            add_count = files_to_copy - len(self.files)
            add_items = random.sample(src_files - self.files, add_count)
            self.files.update(set(add_items))
        elif files_to_copy < len(self.files):
            new_items = random.sample(self.files, files_to_copy)
            self.files = set(new_items)

        for file_name in self.files:
            shutil.copy(os.path.join(src_path, file_name), dst_path)

        print(percent, self.files)

mover = FileMover()

for percent in [
    0.3,
    0.7,
    0.5,
    0.6,
    0.6,
    0.1,
]:
    mover.getPercentageData("./src", f"./dst/{percent}", percent)

For simplicity, let's say we have 10 files from the src directory so that it easily maps 0.3 to 3 files, 0.7 to 7 files, and so on:

$ tree
.
├── script.py
└── src
    ├── 10.txt
    ├── 1.txt
    ├── 2.txt
    ├── 3.txt
    ├── 4.txt
    ├── 5.txt
    ├── 6.txt
    ├── 7.txt
    ├── 8.txt
    └── 9.txt

1 directory, 11 files

Now let's run the script:

$ python script.py  # I manually sorted the data below for a clearer output
0.3 {'3.txt', '8.txt', '9.txt'}
0.7 {'2.txt', '3.txt', '5.txt', '7.txt', '8.txt', '9.txt', '10.txt'}
0.5 {'2.txt', '3.txt', '5.txt', '7.txt', '8.txt'}
0.6 {'1.txt', '2.txt', '3.txt', '5.txt', '7.txt', '8.txt'}
0.6 {'1.txt', '2.txt', '3.txt', '5.txt', '7.txt', '8.txt'}
0.1 {'5.txt'}

As you can see, it correctly considered the previous set of files, whether the percentage increased or decreased or the same. Now let's check the copied files to verify:

$ tree
.
├── dst
│   ├── 0.1
│   │   └── 5.txt
│   ├── 0.3
│   │   ├── 3.txt
│   │   ├── 8.txt
│   │   └── 9.txt
│   ├── 0.5
│   │   ├── 2.txt
│   │   ├── 3.txt
│   │   ├── 5.txt
│   │   ├── 7.txt
│   │   └── 8.txt
│   ├── 0.6
│   │   ├── 1.txt
│   │   ├── 2.txt
│   │   ├── 3.txt
│   │   ├── 5.txt
│   │   ├── 7.txt
│   │   └── 8.txt
│   └── 0.7
│       ├── 10.txt
│       ├── 2.txt
│       ├── 3.txt
│       ├── 5.txt
│       ├── 7.txt
│       ├── 8.txt
│       └── 9.txt
├── script.py
└── src
    ├── 10.txt
    ├── 1.txt
    ├── 2.txt
    ├── 3.txt
    ├── 4.txt
    ├── 5.txt
    ├── 6.txt
    ├── 7.txt
    ├── 8.txt
    └── 9.txt

7 directories, 33 files
  • Related