Home > Blockchain >  Removing multi-line texts using regex with sed
Removing multi-line texts using regex with sed

Time:08-08

I have the following sample text file with all my references which I use for citation in another software (LaTex). I want to remove the "abstract" field and its contents to help reduce the file-size and make its content more relevant.

The sample text is given below:

    doi = {10.3389/fsufs.2021.575056},
    abstract = {Agriculture has come under pressure to meet global food demands, whilst having to meet economic and ecological targets. This has opened newer avenues for investigation in unconventional protein sources. Current agricultural practises manage marginal lands mostly through animal husbandry, which; although effective in land utilisation for food production, largely contributes to global green-house gas (GHG) emissions. Assessing the revalorisation potential of invasive plant species growing on these lands may help encourage their utilisation as an alternate protein source and partially shift the burden from livestock production; the current dominant source of dietary protein, and offer alternate means of income from such lands. Six globally recognised invasive plant species found extensively on marginal lands; Gorse (
              Ulex europaeus
              ), Vetch (
              Vicia sativa
              ), Broom (
              Cytisus scoparius
              ), Fireweed (
              Chamaenerion angustifolium
              ), Bracken (
              Pteridium aquilinum
              ), and Buddleia (
              Buddleja davidii
              ) were collected and characterised to assess their potential as alternate protein sources. Amino acid profiling revealed appreciable levels of essential amino acids totalling 33.05 ± 0.04 41.43 ± 0.05, 33.05 ± 0.11, 32.63 ± 0.04, 48.71 ± 0.02 and 21.48 ± 0.05 mg/g dry plant mass for Gorse, Vetch, Broom Fireweed, Bracken, and Buddleia, respectively. The availability of essential amino acids was limited by protein solubility, and Gorse was found to have the highest soluble protein content. It was also high in bioactive phenolic compounds including cinnamic- phenyl-, pyruvic-, and benzoic acid derivatives. Databases generated using satellite imagery were used to locate the spread of invasive plants. Total biomass was estimated to be roughly 52 Tg with a protein content of 5.2 Tg with a total essential amino acid content of 1.25 Tg ({\textasciitilde}24\%). Globally, Fabaceae was the second most abundant family of invasive plants. Much of the spread was found within marginal lands and shrublands. Analysis of intrinsic agricultural factors revealed economic status as the emergent factor, driven predominantly by land use allocation, with shrublands playing a pivotal role in the model. Diverting resources from invasive plant removal through herbicides and burning to leaf protein extraction may contribute toward sustainable protein, effective land use, and achieving emission targets, while simultaneously maintaining conservation of native plant species.},

    doi = {10.1186/s12864-016-3367-x},
    abstract = {Background: Propionibacterium freudenreichii is an Actinobacterium widely used in the dairy industry as a ripening culture for Swiss-type cheeses, for vitamin B12 production and some strains display probiotic properties. It is reportedly a hardy bacterium, able to survive the cheese-making process and digestive stresses.
Results: During this study, P. freudenreichii CIRM-BIA 138 (alias ITG P9), which has a generation time of five hours in Yeast Extract Lactate medium at 30 °C under microaerophilic conditions, was incubated for 11 days (9 days after entry into stationary phase) in a culture medium, without any adjunct during the incubation. The carbon and free amino acids sources available in the medium, and the organic acids produced by the strain, were monitored throughout growth and survival. Although lactate (the preferred carbon source for P. freudenreichii) was exhausted three days after inoculation, the strain sustained a high population level of 9.3 log10 CFU/mL. Its physiological adaptation was investigated by RNA-seq analysis and revealed a complete disruption of metabolism at the entry into stationary phase as compared to exponential phase.
Conclusions: P. freudenreichii adapts its metabolism during entry into stationary phase by down-regulating oxidative phosphorylation, glycolysis, and the Wood-Werkman cycle by exploiting new nitrogen (glutamate, glycine, alanine) sources, by down-regulating the transcription, translation and secretion of protein. Utilization of polyphosphates was suggested.},
    language = {en},

I want to prune out the abstract and all its contents. So the corresponding output should look like:

doi = {10.3389/fsufs.2021.575056},

doi = {10.1186/s12864-016-3367-x},
language = {en},

I am trying to achieve this using the following 'sed' command: sed 's/\s*abstract.*(\n*.*)*.*[$}]// gm' Test.txt

But it does not seem to work. I have checked using online tools such as https://regex101.com/, and it seems to select the relevant text. But when I try to execute it on my laptop, it doesn't work properly.

I am running this on a Lenovo Thinkpad, MXLinux.

CodePudding user response:

Using GNU sed

$ sed -Ez 's/abstract =[^}]*}([^}]*\.})?,\n  ?//g' input_file
    doi = {10.3389/fsufs.2021.575056},

    doi = {10.1186/s12864-016-3367-x},
    language = {en},

Enabling extended functionality -E and separating lines by nul chars -z, you can then find the match starting from abstract =

  • [^}]*} - Match up till then next occurrence of } and include the curly brace
  • ([^}]*\.)? - This is an optional condition, as above, match till the next occurance of curly brace, but this time, ensure there is a full stop before the curly brace.
  • \n - Include the newline in the match to be removed.
  • ? - Another optional condition, if there is one or more spaces after the newline, remove them also.

The g flag at the end will repeat the removal of the match as many times as it finds it.

CodePudding user response:

In GNU awk you could try following awk code. Written and tested in GNU awk. Using RS variable of GNU awk to mention regex in it and get the required output as per OP's request.

awk -v RS='(^[[:space:]]*|\n[[:space:]]*)doi = {[^}]*},|[[:space:]] language = {en},' '
RT{ print RT }
' Input_file

Here is the Online demo for above code(NOTE: Regex online uses non-capturing group, which is not supported by awk, that's mentioned in their only for understanding purposes).

CodePudding user response:

This might work for you (GNU sed):

 sed -n '/abstract = {/{:a;/},$/b;n;ba};p' file

Turn off implicit printing -n.

If a line contains abstract = {, as long as the current line does not end in },, replace the current line with the next and if it does match, then effectively delete it.

Otherwise print all other lines.

  • Related