I've been struggling on a problem for a few hours.
I've a data-set which looks like this :
References 425 1451 259
/1/J.H.Gibbons,R.L.Macklin:Phys.Rev.B137,1508(1965) 425 1451 260
/2/P.R.Wrean,C.R.Brune,R.W.Kavanagh:Phys.Rev.C49,1205(1994) 425 1451 261
/3/L.Van der Zwan,K.W.Geiger:Nucl.Phys.A152,481(1970) 425 1451 262
/4/T.Murata:JAERI-Conf 98-003,p.215 425 1451 263
3.500000 6 3.844649-2 3.508375 6 3.783472-2 4.000000 6-2.064883-2 425 3 2 7
4.014688 6-2.289045-2 4.403160 6-8.623264-2 4.500000 6-1.022847-1 425 3 2 8
1.450000 7-2.039133-1 1.500000 7-1.930321-1 425 3 2 17
and I wonder how I could read and have in an array or a txt object only the last rows where they're values in each columns. To be sure I would like to have an object like this :
3.500000 6 3.844649-2 3.508375 6 3.783472-2 4.000000 6-2.064883-2 425 3 2 7
4.014688 6-2.289045-2 4.403160 6-8.623264-2 4.500000 6-1.022847-1 425 3 2 8
1.450000 7-2.039133-1 1.500000 7-1.930321-1 425 3 2 17
Sincerely I've found no such thing on StackOverflow so I ask the question directly.
Any answers would be nice. Thank you ! hh
CodePudding user response:
You'll need two components: 1.) A file reader How to read files in python 2.) A filter. I'd recommend regex for this regex in python
Loop through each line using your file reader. Use your filter to look for and extract data on each line.
This example may work if 425 will be at the start of all your data:
import re
data = []
# Allows you to read from your file
with open('your.txt') as f:
# Will loop through each line
for line in f.readlines():
# Looks for 425 and pulls till end of line:
data_point = re.search('425.*', line).group()
data.append( data_point )
CodePudding user response:
The table columns appear to have fixed widths.
Without having more of the table, it is hard to reliably determine if a row is meaningful or not. I've dropped rows if the value in the first column doesn't appear to be a number like 3.500000 6 by looking for .
at the exact position.
The numbers in the first 6 columns seem to be in scientific notation but without the e
. Code fixing that is included at the bottom.
import pandas as pd
filename = r"C:\Users\Bobson Dugnutt\Desktop\table.txt"
column_widths = (11, 11, 11, 11, 11, 11, 4, 2, 3, 5)
df = pd.read_fwf(filename, widths=column_widths, header=None)
print(df, end="\n\n")
# Returns True if it looks like a number like "3.500000 6"
def is_meaningful_row(value):
#print(value)
if isinstance(value, str):
if value[-9] == ".":
return True
return False
# Remove rows if the first column doesn't look like a number like "3.500000 6"
df = df[df[0].map(is_meaningful_row)]
# Reset the index so the first row is no longer index 5
df = df.reset_index(drop=True)
print(df, end="\n\n")
# Replace NA/NaN values with empty strings
df = df.fillna("")
print(df, end="\n\n")
# Transforms a string like "-2.039133-1" to "-2.039133e-1"
def fix_scientific_notation(value):
if isinstance(value, str):
return value[0:1] value[1:].replace(" ", "e ").replace("-", "e-")
return value
# For the first 6 columns, fixes the strange scientific notation and converts
# values from string to float64 or int64
for column in df.columns[:6]:
df[column] = df[column].map(fix_scientific_notation)
df[column] = pd.to_numeric(df[column])
print(df)
Output:
0 1 2 3 4 5 6 7 8 9
0 References NaN NaN NaN NaN NaN 425 1 451 259
1 /1/J.H.Gibb ons,R.L.Mac klin:Phys.R ev.B137,150 8(1965) NaN 425 1 451 260
2 /2/P.R.Wrea n,C.R.Brune ,R.W.Kavana gh:Phys.Rev .C49,1205(1 994) 425 1 451 261
3 /3/L.Van de r Zwan,K.W. Geiger:Nucl .Phys.A152, 481(1970) NaN 425 1 451 262
4 /4/T.Murata :JAERI-Conf 98-003,p.2 15 NaN NaN 425 1 451 263
5 3.500000 6 3.844649-2 3.508375 6 3.783472-2 4.000000 6 -2.064883-2 425 3 2 7
6 4.014688 6 -2.289045-2 4.403160 6 -8.623264-2 4.500000 6 -1.022847-1 425 3 2 8
7 1.450000 7 -2.039133-1 1.500000 7 -1.930321-1 NaN NaN 425 3 2 17
0 1 2 3 4 5 6 7 8 9
0 3.500000 6 3.844649-2 3.508375 6 3.783472-2 4.000000 6 -2.064883-2 425 3 2 7
1 4.014688 6 -2.289045-2 4.403160 6 -8.623264-2 4.500000 6 -1.022847-1 425 3 2 8
2 1.450000 7 -2.039133-1 1.500000 7 -1.930321-1 NaN NaN 425 3 2 17
0 1 2 3 4 5 6 7 8 9
0 3.500000 6 3.844649-2 3.508375 6 3.783472-2 4.000000 6 -2.064883-2 425 3 2 7
1 4.014688 6 -2.289045-2 4.403160 6 -8.623264-2 4.500000 6 -1.022847-1 425 3 2 8
2 1.450000 7 -2.039133-1 1.500000 7 -1.930321-1 425 3 2 17
0 1 2 3 4 5 6 7 8 9
0 3500000.0 0.038446 3508375.0 0.037835 4000000.0 -0.020649 425 3 2 7
1 4014688.0 -0.022890 4403160.0 -0.086233 4500000.0 -0.102285 425 3 2 8
2 14500000.0 -0.203913 15000000.0 -0.193032 NaN NaN 425 3 2 17