I am trying to parse a text file with:
The MELTING results are :
Enthalpy : -181,100 cal/mol ( -756,998 J /mol)
Entropy : -467.3 cal/mol-K ( -1,953.31 J /mol-K)
Melting temperature : 75.13 degrees C.
The MELTING results are :
Enthalpy : -170,800 cal/mol ( -713,944 J /mol)
Entropy : -444 cal/mol-K ( -1,855.92 J /mol-K)
Melting temperature : 70.6 degrees C.
I am trying to parse such that I get one row per entry, with Enthalpy(any one or both), entropy(any one or both) and melting temperature as columns. I tried using
awk '$1=="Enthalpy" {print $0}' file.txt > a
similarly for entropy and melting temperature and combine the columns and parse accordingly. However, i noticed,
awk '$1=="Enthalpy" {print $0}' file.txt | wc -l
results 98181 and is similar for entropy but 92418 for melting temperature.
Independently combining these values, i dont know which one is missing. Is there a way to parse all these three together and have NA or fixed value for the missing melting temperature? If possible, using awk (bash)
CodePudding user response:
Assign each line to a variable when you match the first word. Then when you match the MELTING results
line that starts a new block, print the variables, replacing empty values with NA
. Then empty the values before processing the next block.
Finally, print the lines from the last block in the END
code, since there's no MELTING results
line after it.
awk '$1 == "Enthalpy" { enth = $0 }
$1 == "Entropy" { entr = $0 }
$1 == "Melting" { melt = $0 }
/MELTING results/ && NR > 1 {
printf("%s\n%s\n%s\n", (enth ? enth : "NA"), (entr ? entr : "NA"), (melt ? melt : "NA"));
enth = entr = melt = "";
}
END {
printf("%s\n%s\n%s\n", (enth ? enth : "NA"), (entr ? entr : "NA"), (melt ? melt : "NA"));
}' file.txt > a
CodePudding user response:
1) If the R tag means you want an R solution and assuming you want to keep the first number on each line then to illustrate it we will use the file generated reproducibly in the Note at the end where we have added records with missing fields.
First read it in to a 2 column data frame with columns V1 and V2 replacing the The Melting ...
V1 fields with an empty string and also replacing the first space and everything thereafter in V2 with an empty string. Also remove all commas from V2. Paste it back together at which point is now in Debian Control Format (dcf). Now read that using read.dcf
and convert it to a numeric matrix. (The name=""
argument to textConnection
is needed to circumvent a bug in that function which occurs in long pipelines. This was discussed on r-devel and it seems it is likely already fixed in the development version of R.)
No packages are used.
"melting.txt" |>
read.table(sep = ":", strip.white = TRUE) |>
transform(V1 = sub("The MELTING.*", "", V1),
V2 = sub(" .*", "", gsub(",", "", V2))) |>
with(paste0(V1, ifelse(nchar(V1), ": ", ""), V2)) |>
textConnection(name = "") |>
read.dcf() |>
type.convert(as.is = TRUE)
giving this numeric matrix:
Enthalpy Entropy Melting temperature
[1,] -181100 -467.3 75.13
[2,] -170800 -444.0 70.60
[3,] -181100 NA 75.13
[4,] NA -444.0 70.60
2) Alternately, a mixed awk/R solution would be the following. Assume that melting.awk is in the current directory and contains:
# convert to dcf
BEGIN { FS = " : "; OFS = ": " }
/MELTING/ { print ""; next }
/:/ { sub(/^ */, "", $1); gsub(/ .*|,/, "", $2); print }
Then assuming gawk is on the PATH run this from R. (This likely works with awk too but I only tried it with gawk.)
"gawk.exe -f melting.awk melting.txt" |>
pipe() |>
read.dcf() |>
type.convert(as.is = TRUE)
Note
Lines <- " The MELTING results are :
Enthalpy : -181,100 cal/mol ( -756,998 J /mol)
Entropy : -467.3 cal/mol-K ( -1,953.31 J /mol-K)
Melting temperature : 75.13 degrees C.
The MELTING results are :
Enthalpy : -170,800 cal/mol ( -713,944 J /mol)
Entropy : -444 cal/mol-K ( -1,855.92 J /mol-K)
Melting temperature : 70.6 degrees C.
The MELTING results are :
Enthalpy : -181,100 cal/mol ( -756,998 J /mol)
Melting temperature : 75.13 degrees C.
The MELTING results are :
Entropy : -444 cal/mol-K ( -1,855.92 J /mol-K)
Melting temperature : 70.6 degrees C."
cat(Lines, file = "melting.txt")