I have a text file containing details of about 1000 articles, each containing the same items (PMC, PMID..... AID, SO). I need to have every article details as a single row such that the item names are read as column names, and the respective content as row values:
PMC PMID ........ SO
PMC8882595 35237547 ........ Front ....69001
PMC8967719 35296856 ........ Child Yo... 040
The author names and affiliations columns (FAU, AU, AD) in between are repeated, and I just need to extract the country name from that content, if possible. Or just merge them otherwise.
I am really sorry about the poor phrasing of the question. Here is a glimpse of first two articles to describe it better.
PMC - PMC8882595
PMID- 35237547
IS - 2296-2565 (Electronic)
VI - 10
DP - 2022
TI - Poverty Vulnerability and Health Risk Action Path of Families of Rural Elderly
With Chronic Diseases: Empirical Analysis of 1,852 Families in Central and
Western China.
LID - 776901
AB - Health poverty has become the most important cause of poverty and return to
poverty. Understanding the health risk factors and action paths of poverty in
families of rural elderly with chronic diseases is important to alleviate return
to poverty because of illness. This study selected families with at least one
elderly member (over 60 years old) with chronic diseases (sample size was 1,852
people and localities, the causes of poverty and returning to poverty, and the
types of poverty vulnerabilities. The use efficiency of medical insurance should
be further improved, and the responsibility of medical insurance targeted poverty
alleviation must be clarified.
FAU - Ma, Ying
AU - Ma Y
AD - Department of Health Management, School of Medicine and Health Management, Tongji
Medical College, Huazhong University of Science and Technology, Wuhan, China
FAU - Xiang, Qin
AU - Xiang Q
AD - Department of Health Management, School of Medicine and Health Management, Tongji
Medical College, Huazhong University of Science and Technology, Wuhan, China
FAU - Yan, Chaoyang
AU - Yan C
AD - Department of Health Management, School of Medicine and Health Management, Tongji
Medical College, Huazhong University of Science and Technology, Wuhan, China
FAU - Liao, Hui
AU - Liao H
AD - Department of Health Management, School of Medicine and Health Management, Tongji
Medical College, Huazhong University of Science and Technology, Wuhan, China
FAU - Wang, Jing
AU - Wang J
AD - Department of Health Management, School of Medicine and Health Management, Tongji
Medical College, Huazhong University of Science and Technology, Wuhan, China
LA - eng
PT - Journal Article
DEP - 20220214
PHST- 2021/09/14 [received]
PHST- 2022/01/20 [accepted]
TA - Front Public Health
JT - Frontiers in Public Health
AID - 10.3389/fpubh.2022.776901 [doi]
SO - Front Public Health. 2022 Feb 14;10:. doi:10.3389/fpubh.2022.776901.
PMC - PMC6107082
PMID- 30147212
IS - 0190-7409 (Print)
VI - 79
DP - 2017 Aug
TI - Family poverty and neighborhood poverty: Links with children’s school readiness
before and after the Great Recession.
PG - 368-84
AB - This paper examines how neighborhood and family poverty predict children’s
academic skills and classroom behavior at school entry, and whether associations
additional support.
FAU - Wolf, Sharon
AU - Wolf S
AD - Graduate School of Education, University of Pennsylvania, 3700 Walnut Street,
Room 340, Philadelphia, PA 19104, United States
FAU - Magnuson, Katherine A.
AU - Magnuson KA
AD - Institute for Research on Poverty, University of Wisconsin – Madison, 1180
Observatory Drive, Madison, WI 53706, United States
FAU - Kimbro, Rachel T.
AU - Kimbro RT
AD - Rice University, 6100 Main St., MS-28, Houston, TX 77005, United States
LA - eng
PT - Journal Article
DEP - 20170623
TA - Child Youth Serv Rev
JT - Children and youth services review
AID - 10.1016/j.childyouth.2017.06.040 [doi]
MID - NIHMS984695
SO - Child Youth Serv Rev. 2017 Aug;79:368-84. Epub 2017 Jun 23
doi:10.1016/j.childyouth.2017.06.040.
CodePudding user response:
Here's an admittedly brute-force start for you.
Up front, though, you will need to put some thought onto how you want to aggregate your fields. In general, all of the fields (e.g., "PMC"
, "PHST"
, "FAU"
) must recycling safely, with a length of exactly 1 or the same non-1 length from each group. In this data, in the first batch of data, you have PHST
with length 2 and FAU
with length 5. I know this is known based on the variable data, and that's fine, but that means you need to come up with a clean way to aggregate them. Some options might include list-columns or text-reduction using tokenization, pattern-matching, or whatever. You said you want to extract just the country from the multiple AU
fields, that's fine, but ... that's not always easy given the free-text flow of those fields.
I'm addressing everything except how to find country names and reduce those fields. I strongly encourage looking into list-columns (as opposed to my naive string-concatenation) and some other functions to do the aggregation. (Perhaps another question?)
For now, my naive aggregator:
agg_fun <- function(x, fld) {
# do something different based on fld[1] such as extract a country
paste(x, collapse = " ")
}
First, to read in that data, I'll use some bespoke code to turn it into a frame with which we can work a little more easily:
txt <- readLines("~/StackOverflow/12121693/72706285.txt")
tmp1 <- tmp2 <- character(0)
leadsp <- grepl("^\\s", txt)
tmp1[!leadsp] <- sub("^(\\S )\\s*-.*", "\\1", txt[!leadsp])
tmp2[!leadsp] <- sub("^\\S \\s*-\\s*(.*)", "\\1", txt[!leadsp])
tmp2[leadsp] <- sub("^ *", "", txt[leadsp])
tmpdat <- tibble(field = tmp1, value = tmp2)
tmpdat
# # A tibble: 82 x 2
# field value
# <chr> <chr>
# 1 PMC PMC8882595
# 2 PMID 35237547
# 3 IS 2296-2565 (Electronic)
# 4 VI 10
# 5 DP 2022
# 6 TI Poverty Vulnerability and Health Risk Action Path of Families of Rural Elderly
# 7 NA With Chronic Diseases: Empirical Analysis of 1,852 Families in Central and
# 8 NA Western China.
# 9 LID 776901
# 10 AB Health poverty has become the most important cause of poverty and return to
# # ... with 72 more rows
With that, here's a dplyr/tidyr flow using that data:
library(dplyr)
library(tidyr) # pivot_wider
tibble(field = tmp1, value = tmp2) %>%
filter(nzchar(trimws(field)) | nzchar(trimws(value))) %>%
group_by(grp = cumsum(!is.na(field))) %>%
summarize(field = field[1], value = paste(value, collapse = " ")) %>%
ungroup() %>%
group_by(grp = cumsum(field == "PMC"), field) %>%
summarize(value = agg_fun(value, field[1])) %>%
pivot_wider(grp, names_from = "field", values_from = "value") %>%
ungroup()
# # A tibble: 2 x 22
# grp AB AD AID AU DEP DP FAU IS JT LA LID PHST PMC PMID PT SO TA TI VI MID PG
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 Health poverty has become the most imp~ Department of Health Management, Schoo~ 10.3389/f~ Ma Y Xia~ 20220~ 2022 Ma, Ying Xi~ 2296-25~ Frontier~ eng 776901 2021/09/~ PMC8~ 3523~ Journ~ "Front Public He~ Front~ Poverty Vulnerability and~ 10 NA NA
# 2 2 This paper examines how neighborhood a~ Graduate School of Education, Universi~ 10.1016/j~ Wolf S M~ 20170~ 2017 ~ Wolf, Sharo~ 0190-74~ Children~ eng NA NA PMC6~ 3014~ Journ~ "Child Youth Ser~ Child~ Family poverty and neighb~ 79 NIHM~ 368-~
You can likely now discard grp
(it was artifical), and if you look deeper into many of these strings, you'll realize some of them are quite long.
CodePudding user response:
in Base R you could do:
df <- read.dcf(textConnection(sub(" *- *",':',readLines('my_file.txt'))),all=T)
PMC PMID IS VI DP ....
1 PMC8882595 35237547 2296-2565 (Electronic) 10 2022 ....
2 PMC6107082 30147212 0190-7409 (Print) 79 2017 Aug ....
Note that the duplicated values are put in a list. eg:
df$FAU
[[1]]
[1] "Ma, Ying" "Xiang, Qin" "Yan, Chaoyang" "Liao, Hui" "Wang, Jing"
[[2]]
[1] "Wolf, Sharon" "Magnuson, Katherine A." "Kimbro, Rachel T."
Which indicates the values from the first document and the values from the second document