Home > Software engineering >  Finding which files are being read from during a session (python code)
Finding which files are being read from during a session (python code)

Time:12-07

I have a large system written in python. when I run it, it reads all sorts of data from many different files on my filesystem. There are thousands lines of code, and hundreds of files, most of them are not actually being used. I want to see which files are actually being accessed by the system (ubuntu), and hopefully, where in the code they are being opened. Filenames are decided dynamically using variables etc., so the actual filenames cannot be determined just by looking at the code. I have access to the code, of course, and can change it.

I try to figure how to do this efficiently, with minimal changes in the code:

  1. is there a Linux way to determine which files are accessed, and at what times? this might be useful, although it won't tell me where in the code this happens
  2. is there a simple way to make an "open file" command also log the file name, time, etc... of the open file? hopefully without having to go into the code and change every open command, there are many of them, and some are not being used at runtime.

Thanks

CodePudding user response:

For 1 - You can use

ls -la /proc/<PID>/fd`

Replacing <PID> with your process id. Note that it will give you all the open file descriptors, some of them are stdin stdout stderr, and often other things, such as open websockets (which use a file descriptor), however filtering it for files should be easy.

For 2- See the great solution proposed here - Override python open function when used with the 'as' keyword to print anything

e.g. overriding the open function with your own, which could include the additional logging.

CodePudding user response:

One possible method is to "overload" the open function. This will have many effects that depend on the code, so I would do that very carefully if needed, but basically here's an example:

>>> _open = open
>>> def open(filename):
...     print(filename)
...     return _open(filename)
...
>>> open('somefile.txt')
somefile.txt
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in open
FileNotFoundError: [Errno 2] No such file or directory: 'somefile.txt'

As you can see my new open function will return the original open (renamed as _open) but will first print out the argument (the filename). This can be done with more sophistication to log the filename if needed, but the most important thing is that this needs to run before any use of open in your code

  • Related