Home > Back-end >  Parse data from keyword based materials data file with Python
Parse data from keyword based materials data file with Python

Time:03-04

I have a keyword based materials data file. I want to parse data from this file and create variables and matrices to work on them in a Python script. The material file may have comment lines in the very top starting with the string "**", I simply want to ignore these and parse the data on other lines that follows a keyword of the form *keyword_1, and also their comma-delimited parameters of the form param_1=param1.

enter image description here

What is the fastest and easiest way to parse data from this kind of keyword based text file with Python? Can I use pandas for this and how?

below is a sample input material file: alloy_1.nam

*************************************************
**               ALLOY_1 MATERIAL DATA
*************************************************
*MATERIAL,NAME=ALLOY_1
*ELASTIC,TYPE=ISO
2.08E5,0.3,291.
2.04E5,0.3,422.
1.96E5,0.3,589.
1.85E5,0.3,755.
1.74E5,0.3,922.
1.61E5,0.3,1089.
1.52E5,0.3,1220.
*EXPANSION,TYPE=ISO,ZERO=293.
13.5E-6,291.
13.6E-6,422.
13.9E-6,589.
14.2E-6,755.
14.7E-6,922.
15.5E-6,1089.
16.4E-6,1200.
*DENSITY
7.92E-9
*CONDUCTIVITY
10.,273.
18.,873.
27.,1373.
*SPECIFIC HEAT
450.e6,273.
580.e6,873.
710.e6,1373.

CodePudding user response:

The way is to create a list of dictionaries where each element is a key = Category name and the data in the form of dataframe. We have to use a temporary dictionary to store the comma separated data which gets appended into the list of dictionary each time a new category is found.

Use the pandas.Dataframe() to create the dataframe

Below is the code:

with open('/Users/rpghosh/scikit_learn_data/test.txt') as f:
    lines = f.readlines()

# empty list of dataframes
lst_dfs = []

# empty dictionary to store each dataframe temporarily
d = {}
dfName = ''
PrevdfName = ''
createDF = False

for line in lines:
    
    if re.match('^\*{1}([A-Za-z0-9,=]{1,})\n$', line):
        variable = line.lstrip('*').rstrip().split(',')
        PrevdfName = dfName
        dfName = variable[0]

        createDF = False

        if (not createDF) and len(d) > 0:
            df = pd.DataFrame(d)
            # append a dictionary which has category and dataframe
            lst_dfs.append( { PrevdfName : df} )
            d = {}
            
        
    elif re.match('^[0-9]([0-9,]){1,}\n$',line):
        #dfName = PrevdfName
        data = line.rstrip().split(',')

        
        for i in range(len(data)):
        
            # customised column name 
            colName = 'col'   str(i 1)

            # if the colname is already present in the 
            # dictionary keys then append the element 
            # to existing key's list
            if colName in d.keys():
                d[colName].append( data[i])
            else:
                d[colName] = [data[i]]                
    else:
        createDF = False
        d={}

df = pd.DataFrame(d)
lst_dfs.append({ dfName : df})

To view the output , you will have a list of dataframes , so the code will be -

for idx, df in enumerate(lst_dfs):
    print(f"{idx=}")
    print(df)
    print()

Output :

idx=0
{'elastic':   col1 col2 col3
0   21   22   23
1   11   12   13
2   31   32   33}

idx=1
{'expansion':   col1 col2 col3
0    4    5    6
1   41   15   16
2   42   25   26}

idx=2
{'density':     col1
0  12343}

idx=3
{'conductivity':   col1 col2 col3 col4 col5 col6
0   54   55   56   51   55   56
1   42   55   56   51   55   56
2   54   55   56   51   55   56
3   42   55   56   51   55   56}
  • Related