Let's say my data is df <- c("Author1","Reference1","Abstract1","Author2","Reference2","Abstract2","Author3","Reference3","Author4","Reference4","Abstract4")
.
This is a series in which the order is Author, Reference and Abstract. But in some cases, the Abstract data is missing. (In this example, the third Abstract is missing.) So, how can I add NA values in place of Abstract, when Abstract is missing?
In other words, If an element in the vector starts with the word "Reference", but its next element doesn't start with the word "Abstract", I want to add an NA value just after the element starting with "Reference". The result vector should be
result <- c("Author1","Reference1","Abstract1","Author2","Reference2","Abstract2","Author3","Reference3",NA,"Author4","Reference4","Abstract4")
How can I do it?
I have tried the append function in R, but for using it, I need to have the index number of the element where I want to add NA. So, it takes a manual entry for each NA element.
CodePudding user response:
Here's an approach.
Bascially you get two vectors:
- which tests whether that element containts
Reference
, the other that checks that the element does not containAbstract
- You offset one vector by 1, because you want to test whether abstract follows reference.
- you take the logical and
- then you insert
NA
s into the positions where abstract should be but isn't withappend()
ab_missing <- grepl("Reference", df) & c(!grepl("Abstract", df)[-1], FALSE)
df <- append(df, NA, which(ab_missing))
df
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3" "Reference3" NA "Author4"
[11] "Reference4" "Abstract4"
CodePudding user response:
One way (and the only way I get these things done) is to think in tibbles or data frames: (So this is not the best approach)!
- We create a tibble of one column calling
x
, - then we group by the numbers e.g. 1,1,1 with
parse_number()
function fromreadr
(I loveparse_number()
), - With
summarise(cur_data()[seq(3),])
see expand each group to the max rows, see here Expand each group to the max n of rows 3a stop here and pull if NA is desired otherwise continue - finally we use paste with r's recycling ability and pull the vector:
1. In case NA is desired:
library(dplyr)
library(readr)
my_vector <- tibble(x = c("Author1","Reference1","Abstract1","Author2","Reference2",
"Abstract2","Author3","Reference3","Author4","Reference4","Abstract4")) %>%
group_by(group= parse_number(x)) %>%
summarise(cur_data()[seq(3),]) %>%
pull(x)
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3"
[8] "Reference3" NA "Author4" "Reference4" "Abstract4"
2. In case the lacking word is desired:
library(dplyr)
library(readr)
my_vector <- tibble(x = c("Author1","Reference1","Abstract1","Author2","Reference2",
"Abstract2","Author3","Reference3","Author4","Reference4","Abstract4")) %>%
group_by(group= parse_number(x)) %>%
summarise(cur_data()[seq(3),]) %>%
mutate(group = paste0(c("Author", "Reference", "Abstract"), group)) %>%
pull(group)
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3"
[8] "Reference3" "Abstract3" "Author4" "Reference4" "Abstract4"
CodePudding user response:
A slightly different approach might be:
c(sapply(split(x, cumsum(grepl("Author", x))), function(x) head(c(x, NA_character_), 3)))
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3"
[8] "Reference3" NA "Author4" "Reference4" "Abstract4"