Home > database >  NLP and Pandas data extraction
NLP and Pandas data extraction

Time:01-09

Findings Impression File_name_Location
Lung bases: No pulmonary nodules or evidence of pneumonia No findings on the current CT to account for the patient's clinical complaint of abdominal pain. /home/text_file/p123456.txt

I have a pandas dataframe with 3 columns (from chest-Xray report) the columns are "findings", "impression" and "file_Name" with directory information. I have have separate directory (folders) of chest-Xray images that i have to crawl through to get the matching "file_Name" (becuase, there are more image files in the directory, than in my text dataframe)from image directory and put in the same row of above dataframe, and the image file name should be matched with the text file name.

need for the code to solve this.

An example of image file directory is as below:

          /home/files/f1/images/i123456.jpg

there are folder from f1 to f25 and each having hundreds of .jpg file.

Update: Corralien's code raised an exception:

---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    File ~/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3803, in Index.get_loc(self, key, method, tolerance)
       3802 try:
    -> 3803     return self._engine.get_loc(casted_key)
       3804 except KeyError as err:
    
    File ~/miniconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
    
    File ~/miniconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()
    
    File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()
    
    File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()
    
    KeyError: 'File_name_Location'
    
    The above exception was the direct cause of the following exception:
    
    KeyError                                  Traceback (most recent call last)
    Cell In[79], line 9
          6     file = f"{img.stem[1:]}.txt"
          7     images[file] = str(img)
    ----> 9 df['Image_name_Location']=df['File_name_Location'].str.split('/').str[-1].map(images)
    
    File ~/miniconda3/lib/python3.9/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
       3803 if self.columns.nlevels > 1:
       3804     return self._getitem_multilevel(key)
    -> 3805 indexer = self.columns.get_loc(key)
       3806 if is_integer(indexer):
       3807     indexer = [indexer]
    
    File ~/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key, method, tolerance)
       3803     return self._engine.get_loc(casted_key)
       3804 except KeyError as err:
    -> 3805     raise KeyError(key) from err
       3806 except TypeError:
       3807     # If we have a listlike key, _check_indexing_error will raise
       3808     #  InvalidIndexError. Otherwise we fall through and re-raise
       3809     #  the TypeError.
       3810     self._check_indexing_error(key)
    
       KeyError: 'File_name_Location'

CodePudding user response:

IIUC, there is a relation between text and image files: p123456.txt -> f??/images/i123456.jpg.

You can use the following code:

# create an index of your images with the above relation
images = {}
for img in pathlib.Path('/home/files').glob('f*/images/*.jpg'):
    file = f"p{img.stem[1:]}.txt"
    images[file] = str(img)

df['Image_name_Location']=df['File_name_Location'].str.split('/').str[-1].map(images)

Output:

>>> df
            File_name_Location                 Image_name_Location
0  /home/text_file/p123456.txt   /home/files/f1/images/i123456.jpg
1   home/text_file/p987654.txt  /home/files/f22/images/i987654.jpg

CodePudding user response:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3803, in Index.get_loc(self, key, method, tolerance)
   3802 try:
-> 3803     return self._engine.get_loc(casted_key)
   3804 except KeyError as err:

File ~/miniconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/miniconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'File_name_Location'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[79], line 9
      6     file = f"{img.stem[1:]}.txt"
      7     images[file] = str(img)
----> 9 df['Image_name_Location']=df['File_name_Location'].str.split('/').str[-1].map(images)

File ~/miniconda3/lib/python3.9/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
   3803 if self.columns.nlevels > 1:
   3804     return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
   3806 if is_integer(indexer):
   3807     indexer = [indexer]

File ~/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key, method, tolerance)
   3803     return self._engine.get_loc(casted_key)
   3804 except KeyError as err:
-> 3805     raise KeyError(key) from err
   3806 except TypeError:
   3807     # If we have a listlike key, _check_indexing_error will raise
   3808     #  InvalidIndexError. Otherwise we fall through and re-raise
   3809     #  the TypeError.
   3810     self._check_indexing_error(key)

   KeyError: 'File_name_Location'
  • Related