I am working with some data where one of the columns looks like
21070808(136)|19995886(87)|21280165(66)
20226255(57)|21440646(54)
...
Just to be clear, this is a single column. Each number which is not in parenthesis represents a publication id (e.g., 21070808) and the number in parenthesis represents the number of citations that this publication received (e.g., publication 21070808 received 136 citations).
For each observation, I would like to count the number of publications as well as the total number of citations. For instance taking the 2 observations above, I would like to get 2 columns (column1=Number of publications and column2=Citations):
Number of publications - Citations
3 - 289
2 - 111
I have tried to look for solutions in R/Stata but could not get anything to work. I think for the number of publications I could just count the number of "|" character and add 1. But for the number of total citations, I am a bit more confused...
Any help would be really appreciated. I am indifferent between R/Stata (and even Python) :)
CodePudding user response:
library(tidyverse)
df %>%
rowid_to_column() %>%
separate_rows(col1, sep = '[|]')%>%
separate(col1, c('num', 'cit'),convert = TRUE, extra = 'drop')%>%
group_by(rowid)%>%
summarise(num = n(), cit = sum(cit))%>%
select(-rowid)
# A tibble: 2 x 2
num cit
<int> <int>
1 3 289
2 2 111
CodePudding user response:
df<-data.frame(x=c("21070808(136)|19995886(87)|21280165(66)","20226255(57)|21440646(54)"))
df$count<-str_count(df$x, "\\|") 1
df$sum<-str_extract_all(df$x, "(?<=\\()[^()]*(?=\\))")
df$sum<-lapply(df$sum, function(y) sum(as.numeric(y)))
df
x sum count 1 21070808(136)|19995886(87)|21280165(66) 289 3 2 20226255(57)|21440646(54) 111 2
Regex sources to match data in parenthesis - Pattern to match only characters within parentheses
CodePudding user response:
Here is a Stata solution.
clear
input str42 problem
"21070808(136)|19995886(87)|21280165(66)"
"20226255(57)|21440646(54)"
end
gen count = strlen(problem) - strlen(subinstr(problem, "|", "", .)) 1
* ssc install moss
moss problem, match("(\([0-9] )") regex
destring _match*, ignore("(") replace
egen citations = rowtotal(_match*)
keep problem count citations
list
------------------------------------------------------------
| problem count citati~s |
|------------------------------------------------------------|
1. | 21070808(136)|19995886(87)|21280165(66) 3 289 |
2. | 20226255(57)|21440646(54) 2 111 |
------------------------------------------------------------