Let's say I have some music files organized (poorly) by artist name, for example:
/data/myfolder/Jay Z/some_file1.mp3
/data/myfolder/Jay-Z/some_file2.mp3
/data/myfolder/JayZ/some_file3.mp3
/data/myfolder/Destiny's Child/some_file4.mp3
/data/myfolder/Destinys Child/some_file5.mp3
I want to run some batch operations using regex matching. However, I want to ignore the special characters within the artist's names when finding my matches. I could programmatically replace special characters with python, but I'm wondering if its possible to do it completely with the regex pattern.
For example, the following code would only work on some_file1.mp3
and some_file4.mp3
as it is currently written:
import os
import re
artists = ["Jay Z", "Destiny's Child"]
root = "/data/myfolder/"
for filepath in os.listdir(root):
for artist in artists:
pattern = r"\/data\/myfolder\/{}\/.*.mp3".format(artist)
match = re.search(pattern, filepath)
if match:
...do some stuff...
Is there some way to modify my regex pattern from /\/data\/myfolder\/{}\/.*.mp3.format(artist)
so that it would successfully match even when there is a dash, single quote, or other specified special character within the string? Basically, I'm trying to ignore the presence of certain characters anywhere in a string when looking for a match.
CodePudding user response:
First things first, your for filepath in os.listdir(root)
returns the list of subfolfders inside root
, but not the files in them. You need to use os.walk
:
for dirpath, dirnames, filenames in os.walk(root):
if not dirnames:
for filename in filenames:
filepath = os.path.join(dirpath, filename)
Now, if you want to use a regex that ignores any chars of your choice inside some fixed string used as part of a regex, you can only try the fuzzy matching capabilities of the PyPi regex. The idea is to remove all the ignored chars from the artists
items, and then allow any amount of these character insertions in the artist subfolder part.
See the Python code:
import regex, os
artists = ["Jay Z", "Destiny's Child"]
artists = [regex.sub(r"[',. -] ", "", s) for s in artists]
root = r'/data/myfolder'
for dirpath, dirnames, filenames in os.walk(root):
if not dirnames:
for filename in filenames:
filepath = os.path.join(dirpath, filename)
for artist in artists:
pattern = r"{}[\\/](?:{}){{i:[',. -]}}[\\/][^\\/]*\.mp3$".format(regex.escape(root), artist)
match = regex.search(pattern, filepath)
if match:
print(match.group())
Note the [\\/]
is used to match both Windows and Linux folder separators. I also added a space to the list of ignored chars.
The artists = [regex.sub(r"[',. -] ", "", s) for s in artists]
is the prep step to remove ignored chars from the artists
subfolder names.
The regex looks like /data/myfolder[\\/](?:DestinysChild){i:[',. -]}[\\/][^\\/]*\.mp3$
:
/data/myfolder
- a literal root part[\\/]
- a/
or\
char(?:DestinysChild){i:[',. -]}
-DestinyChild
string with any amount of space, apostrophe, hyphen, dot or comma insertions[\\/]
- a/
or\
char[^\\/]*
- zero or more chars other than/
and\
\.mp3$
-.mp3
at the end of string.
CodePudding user response:
pattern = re.compile("/data/myfolder/.*[^/]/.*.mp3")
try to do it like this.
CodePudding user response:
put it inside bracket [{}]
pattern = r"\/data\/myfolder\/[{}] \/.*.mp3".format(artist)