Home > Software engineering >  text data extraction from a string
text data extraction from a string

Time:12-29

I have text data that looks like the following, and I want to put the data into a pandas dataframe, as per column extract from data ('EXAMINATION','TECHNIQUE', 'COMPARISON','FINDINGS', 'IMPRESSION') , and put the data related to that column into the column.

'FINAL REPORT EXAMINATION: CHEST PA AND LAT INDICATION: F with new onset ascites eval for infection TECHNIQUE: Chest PA and lateral COMPARISON: None FINDINGS: There is no focal consolidation pleural effusion or pneumothorax Bilateral nodular opacities that most likely represent nipple shadows The cardiomediastinal silhouette is normal Clips project over the left lung potentially within the breast The imaged upper abdomen is unremarkable Chronic deformity of the posterior left sixth and seventh ribs are noted IMPRESSION: No acute cardiopulmonary process'

But I'm stuck. Before that I extract the data from text file, clean it and this is the outcome, from here I have to put the data into pandas dataframe.

CodePudding user response:

It looks like the input is organized such that EXAMINATION, TECHNIQUE, etc. occur in that order.

One approach is to iterate over pairs of strings and use .split() to select content between them. Here is one approach:

import pandas as pd

data = 'FINAL REPORT EXAMINATION: CHEST PA AND LAT INDICATION: F with new onset ascites eval for infection TECHNIQUE: Chest PA and lateral COMPARISON: None FINDINGS: There is no focal consolidation pleural effusion or pneumothorax Bilateral nodular opacities that most likely represent nipple shadows The cardiomediastinal silhouette is normal Clips project over the left lung potentially within the breast The imaged upper abdomen is unremarkable Chronic deformity of the posterior left sixth and seventh ribs are noted IMPRESSION: No acute cardiopulmonary process'

strings = ('EXAMINATION','TECHNIQUE', 'COMPARISON','FINDINGS', 'IMPRESSION', '')
out = {}

for s1, s2 in zip(strings, strings[1:]):
    if not s2:
        text = data.split(s1)[1]
    else:
        text = data.split(s1)[1].split(s2)[0]
    out[s1] = [text]

print(pd.DataFrame(out))

Which results in:

                                         EXAMINATION                TECHNIQUE COMPARISON                                           FINDINGS                          IMPRESSION
0  : CHEST PA AND LAT INDICATION: F with new onse...  : Chest PA and lateral     : None   : There is no focal consolidation pleural effu...  : No acute cardiopulmonary process

CodePudding user response:

Solution as follows, please note the following assumptions:

  1. Keywords as presented are located in that order within the sample text.
  2. The keywords are not contained within the text to be extracted.
  3. Each keyword is followed by a ": " (the colon and whitespace is removed).

Solution

import pandas as pd

sample = "FINAL REPORT EXAMINATION: CHEST PA AND LAT INDICATION: F with new onset ascites eval for infection TECHNIQUE: Chest PA and lateral COMPARISON: None FINDINGS: There is no focal consolidation pleural effusion or pneumothorax Bilateral nodular opacities that most likely represent nipple shadows The cardiomediastinal silhouette is normal Clips project over the left lung potentially within the breast The imaged upper abdomen is unremarkable Chronic deformity of the posterior left sixth and seventh ribs are noted IMPRESSION: No acute cardiopulmonary process"

keywords = ["EXAMINATION", "TECHNIQUE", "COMPARISON", "FINDINGS", "IMPRESSION"]


# Create function to extract text between each of the keywords
def extract_text_using_keywords(clean_text, keyword_list):
    extracted_texts = []
    for prev_kw, current_kw in zip(keyword_list, keyword_list[1:]):
        prev_kw_index = clean_text.index(prev_kw)
        current_kw_index = clean_text.index(current_kw)
        extracted_texts.append(clean_text[prev_kw_index   len(prev_kw)   2:current_kw_index])
        # Extract the text after the final keyword in keyword_list (i.e. "IMPRESSION")
        if current_kw == keyword_list[-1]:
            extracted_texts.append(clean_text[current_kw_index   len(current_kw)   2:len(clean_text)])
    return extracted_texts


# Extract text
result = extract_text_using_keywords(sample, keywords)
# Create pandas dataframe
df = pd.DataFrame([result], columns=keywords)

print(df)

# To append future results to the end of the pandas df you can use
# df.loc[len(df)] = result

Output

   EXAMINATION                                        TECHNIQUE                  COMPARISON    FINDINGS                                           IMPRESSION
0  CHEST PA AND LAT INDICATION: F with new onset ...  Chest PA and lateral       None          There is no focal consolidation pleural effusi...  No acute cardiopulmonary process
  • Related