Home > Blockchain >  delete files with specific extension located as a list in a textfile
delete files with specific extension located as a list in a textfile

Time:10-06

I am currently trying to clean up some media folders from a webserver. The thing is every single file is duplicated in multiple different resolutions, and not all are the same.

ex: picture1.jpg also has picture1-150x150.jpg, picture1-100x100.jpg, and picture1-50x50.jpg. And while a lot of them are the same, there are also a lot of them that are different.

So first I tried this:

import os
    
 dir_name = "path"
 test = os.listdir(dir_name)
    
 for item in test:
    
     if item.endswith("150x150.jpg"):
         os.remove(os.path.join(dir_name, item))

It does its job, but it got quite bloated after adding all kinds of different resolutions and file extensions (jpg, jpeg, png, etcpp):

if item.endswith("-150x150.jpg"):
        os.remove(os.path.join(dir_name, item))
 if item.endswith("-100x100.jpg"):
        os.remove(os.path.join(dir_name, item))
 if item.endswith("-75x75.jpeg"):
        os.remove(os.path.join(dir_name, item))
 if item.endswith("-50x50.jpeg"):
        os.remove(os.path.join(dir_name, item))
 
       etc...

So I tried to type those resolutions into a textfile and use it as a list.

import os

dir_name = "path"
folder = os.listdir(dir_name)

with open('list.txt') as f:
    lines = f.read().splitlines()

for file in folder:
    if file.endswith(str(lines)):
        os.remove(os.path.join(dir_name, file))

While I am able to read and alter code to a certain extent, this is all I managed to do after half a day with Google. Therefore I kindly ask for any help or direction.

CodePudding user response:

The method endswith accepts a tuple as an argument, which means you can combine all your extensions into a single variable.

extensions = ("-150x150.jpg","-100x100.jpg","-75x75.jpeg","-50x50.jpeg")

And then you pass this variable to endswith

if file.endswith(extensions):
    os.remove(os.path.join(dir_name, file))

Here is the snippet I used for proof of concept:

files = [
    "file1",
    "file2.jpg",
    "file123",
    "file4.jpg.old",
    "file5.txt"
]

extensions = (
    ".jpg",
    ".exe",
    ".txt"
)

for file in files:
    if file.endswith(extensions):
        print(f'File :{file} should be delete')
    else:
        print(f'Skipping:{file}')

This returned:

╰─ python3 app.py
Skipping:file1
File :file2.jpg should be delete
Skipping:file123
Skipping:file4.jpg.old
File :file5.txt should be delete

CodePudding user response:

I think you need to go through all the elements of the lines List

Then, if an element of the List lines occures in the file name, it's deleted

for file in folder:
    for line in lines:
        if file.endswith(str(line)):
        os.remove(os.path.join(dir_name, file))

CodePudding user response:

Do any of the file names have dashes in them aside from when it has the "-150x150.jpg"? If not, you could do the following:

import os

dir_name = "path"
folder = os.listdir(dir_name)

for file in folder:
    split_file_name = file.split('-')
    if len(split_file_name) > 1:
        os.remove(os.path.join(dir_name, file))

If you cannot guarantee that there will only be one dash, then I think regex will be your best bet.

import os
import re

dir_name = "path"
folder = os.listdir(dir_name)

pattern = re.compile('[a-zA-Z0-9_\-] -\d x\d .jpg')

for file in folder:
    if pattern.match(file):
        os.remove(os.path.join(dir_name, file))

CodePudding user response:

First of, if you are working Linux, the obvious way to solve this problem is to use a bash file:

# cleanup.sh
rm path/*-150x150.jpg
rm path/*-100x100.jpg
rm path/*-75x75.jpg
rm path/*-50x50.jpg

Just run this script and you are done.

If you insist on using Python, then this solution is a translation of the bash approach:

import os

dir_name = "path"
to_be_deleted = [
    "*-150x150.jpg",
    "*-100x100.jpg",
    "*-75x75.jpeg",
    "*-50x50.jpeg",
]

for wildcard in to_be_deleted:
    os.system(f"rm {dir_name}/{wildcard}")

Update

This bash file is even shorter:

rm path/*-*x*.{jpg,jpeg}

Update 2

If under Windows, you might not have the rm command, so the Python solution would be to use the glob library:

import glob
import os

dir_name = "path"
to_be_deleted = [
    "*-150x150.jpg",
    "*-100x100.jpg",
    "*-75x75.jpeg",
    "*-50x50.jpeg",
]

for wildcard in to_be_deleted:
    for path in  glob.glob(f"{dir_name}/{wildcard}"):
        os.remove(path)

CodePudding user response:

You could simply make a regex that can handle all of the different possibilities, and use it to filter file names. I wrote this example using the loop you provided. You can change this to loop to whatever is necessary to gather file names. The main point of this example is the regex filter.

import re, os

dir_name = "path"
test     = os.listdir(dir_name)
    
fmt      = re.compile(r'([\w\d_] )-\d{1,4}x\d{1,4}\.(jpg|png|jpeg|gif|bmp|tga)', re.I)
for item in test:
    if fmt.search(item):
         os.remove(os.path.join(dir_name, item))

If you don't understand the regex, here is a breakdown:

([\w\d_] )

Get consecutive words, digits and underscores. (ex: 'my_family_pic1'). The means there should be at least 1 word, digit or underscore, but keep getting as many as are consecutively present. Complementary to is * which will match 0 or more consecutive occurrences of the described data.

-\d{1,4}x\d{1,4}

Get any combination that falls from '-0x0' to '-9999x9999'. The {1,4} part is just saying there should be from 1 to 4 consecutive characters of the type described before it. In this case that type would be digits.

\.(jpg|png|jpeg|gif|bmp|tga)

This part literally says that we expect a dot followed by 'jpg' OR 'png' OR 'jpeg' OR... We have to escape the dot because a dot in regex means "any non-whitespace character", unless we were to use the re.S flag, in which case dot would match anything. By escaping it we are telling regex that we really mean to just find a dot. The (group) is used to contain a group of possibilities, segregate logic and/or sequester a piece of data that can be referred to directly via it's group index. As used here you could perceive this as a conditional statement. If we removed the grouping here regex would find ex:'some_file-10x10.jpg' or completely fail. Encountering ex:'.png' but only looking for ex:'png', will never match.


The current regex will only find images that have dimensions in the file name (ie.. -150x150). If you want to delete any and every image you can change the regex to:

fmt = re.compile(r'([^.] )\.(jpg|png|jpeg|gif|bmp|tga)', re.I)

This will start off by accepting every character that is not(^) a dot, followed by a dot and the extension. It is the equivalent of saying "We don't care about the beginning, just make sure it ends this certain way". However, this expression will fail if there are any dots within the actual file name. We don't have to escape the first dot here because it is within a [character range] which implies we are referring to it as a character. Generally, a [character range] is used to list all of the characters or types that we expect in this position. When you start a [character range] with a not (^), we are telling regex all the characters and types that should NOT be in this position.

If you decide to use regex and want more information, you can find it here. My little primer above is probably about 15% of all there is to know about it. If you understand this much, learning the rest is mostly trivial.

ASIDE:

If I had to do what you are doing I would not delete any of the images. I would move them all to a directory with a log that has a recording of where each one came from. Then you could make sure none of it is something you actually want to keep. Once you've audited all of the images and are satisfied that the entire directory is junk, you can manually delete the entire directory. In other words, your way assumes that not a single image is actually a part of an interface, and referenced according to the device that is viewing it.

  • Related