How to extract data from this .txt files and put the data into pandas dataframe by columns (like &#0-CodePudding

I am trying to extract the data by the subject ('findings', 'impression') and trying to put it on pandas dataframe

CodePudding user response：

Here is an example code with two text files (text1, text2) and subjects (Indication, comparision, findings, and impression.

import re
import pandas as pd

text1 = '''FINAL REPORT EXAMINATION: CHEST (PORTABLE AP) INDICATION: ___ year old woman with cough neutropenic // r/o infection TECHNIQUE: Single frontal view of the chest COMPARISON: Chest radiograph from ___, ___. FINDINGS: Right subclavian catheter tip terminates in the lower SVC. Cardiac size is normal. The lungs are clear. There is no pneumothorax or pleural effusion. IMPRESSION: No evidence of pneumonia. '''
text2 = '''FINAL REPORT EXAMINATION: CHEST (PORTABLE AP) INDICATION: ___ year old woman with cough neutropenic // r/o infection TECHNIQUE: Single frontal view of the chest COMPARISON: Chest radiograph from ___, ___. FINDINGS: Right subclavian catheter tip terminates in the lower SVC. Cardiac size is normal. The lungs are clear. There is no pneumothorax or pleural effusion. IMPRESSION: No evidence of pneumonia. '''

subjects = ("INDICATION", "COMPARISON", "FINDINGS", "IMPRESSION")

data = [re.split('|'.join(subjects), text)[1:] for text in [text1, text2]]
data = pd.DataFrame(data, columns = subjects)

the data is as follows.

INDICATION  COMPARISON  FINDINGS    IMPRESSION
0   : ___ year old woman with cough neutropenic //...   : Chest radiograph from ___, ___.   : Right subclavian catheter tip terminates in ...   : No evidence of pneumonia.
1   : ___ year old woman with cough neutropenic //...   : Chest radiograph from ___, ___.   : Right subclavian catheter tip terminates in ...   : No evidence of pneumonia.

CodePudding user response：

To extract data from a .txt file and put it into a Pandas dataframe, you can use the following steps:

Import the Pandas library:

import pandas as pd

Open the .txt file and read its contents into a string:

with open('file.txt', 'r') as f:
    data = f.read()

Split the string into a list of lines:

lines = data.split('\n')

Create an empty dictionary to store the data:

data_dict = {}

Iterate over the list of lines and extract the data for each column:

for line in lines:
    if 'Indication:' in line:
        data_dict['Report'] = line.split(':')[1].strip()
    elif 'Comparison:' in line:
        data_dict['Findings'] = line.split(':')[1].strip()
    elif 'Technique:' in line:
        data_dict['Impression'] = line.split(':')[1].strip()
    elif 'Findings:' in line:
        data_dict['Recommendation'] = line.split(':')[1].strip()

Create a Pandas dataframe from the dictionary:

df = pd.DataFrame.from_dict(data_dict, orient='index').transpose()

Display the dataframe:

print(df)

This will extract the data from the .txt file and create a Pandas dataframe with columns 'Report', 'Findings', 'Impression', and 'Recommendation'. You can adjust the code to suit your specific needs and data structure.

CodePudding user response：

To extract data from a file or other source and store it in a Pandas dataframe, you can use the pandas.read_* function that corresponds to the format of your data. For example, if your data is in a CSV file, you can use the pandas.read_csv function to read the data into a dataframe.

Once you have read the data into a dataframe, you can use the dataframe's indexing and slicing features to select and extract specific rows or columns of data. For example, you can use the df.loc[] indexer to select rows based on their label, or the df[] indexer to select columns by name.

Here is an example of how you might extract data by subject and store it in a Pandas dataframe:

import pandas as pd

# Read the data into a dataframe
df = pd.read_csv('data.csv')

# Extract rows with the subject 'findings'
findings = df.loc[df['subject'] == 'findings']

# Extract rows with the subject 'impression'
impression = df.loc[df['subject'] == 'impression']

This code reads the data from the data.csv file into a dataframe called df, and then uses the df.loc[] indexer to select rows where the subject column is equal to "findings" or "impression". It stores the selected rows in separate dataframes called findings and impression.

You can then use the findings and impression dataframes to further manipulate or analyze the data as needed. For example, you can use the df.head() method to view the first few rows of the dataframe, or use the df.describe() method to get summary statistics for the data.