ffmpeg in python - extracting meta data-CodePudding

I use ffmpeg for Python to extract meta data from video files. I think the official documentation is available here: https://kkroening.github.io/ffmpeg-python/

To extract meta data (duration, resolution, frames per second, etc.) I use the function "ffmpeg.probe" as provided. Sadly, when running it on a large amount of video files, it is rather inefficient as it seems to (obviously?) load the whole file into memory each time to read just a small amount of data.

If this is not what it does, maybe someone could explain what the cause might be for the rather extensive runtime.

Otherwise, is there any way to retrieve meta data in a more efficient way using ffmpeg or some other library?

Any feedback or help is very much appreciated.

CodePudding user response：

I suggest using the ffprobe directly [https://ffmpeg.org/ffprobe.html] Unfortunately ffmpeg can be CPU expensive sometimes but it all depends on your hardware specs.

CodePudding user response：

running it on a large amount of video files

This is where multithreading/multiprocessing could potentially be helpful IF the slowdown comes from subprocess spawning (and not from actual I/O). This may not help as file I/O in general takes time compared to virtually everything else.

load the whole file into memory each time to read just a small amount of data

This is incorrect assertion IMO. It should only read relevant headers/packets to retrieve the metadata. You are likely paying subprocess tax more than anything else.

a way to retrieve meta data

(1) Adding to what @Peter Hassaballeh said above, ffprobe has options to limit what to look up. If you only need to get the container(format)-level info or only of a particular stream, you can specify exactly what you need (to an extent). This could save some time.

(2) You can try MediaInfo (another free tool like ffprobe) which you should be able to call from Python as well.

(3) If you are dealing with a particular file format, the fastest way is to decode it yourself in Pyton, read only the bytes that matters to you. Depending on what is the current bottleneck, it may not be that drastic of an improvement, tho.