Home > Software engineering >  Split a column and count the number of elements sum elements
Split a column and count the number of elements sum elements

Time:09-15

I am working with some data where one of the columns looks like

21070808(136)|19995886(87)|21280165(66)

20226255(57)|21440646(54)

...

Just to be clear, this is a single column. Each number which is not in parenthesis represents a publication id (e.g., 21070808) and the number in parenthesis represents the number of citations that this publication received (e.g., publication 21070808 received 136 citations).

For each observation, I would like to count the number of publications as well as the total number of citations. For instance taking the 2 observations above, I would like to get 2 columns (column1=Number of publications and column2=Citations):

 Number of publications - Citations

             3          -     289
             2          -     111 

I have tried to look for solutions in R/Stata but could not get anything to work. I think for the number of publications I could just count the number of "|" character and add 1. But for the number of total citations, I am a bit more confused...

Any help would be really appreciated. I am indifferent between R/Stata (and even Python) :)

CodePudding user response:

library(tidyverse)

df %>%
  rowid_to_column() %>%
  separate_rows(col1, sep = '[|]')%>%
  separate(col1, c('num', 'cit'),convert = TRUE, extra = 'drop')%>%
  group_by(rowid)%>%
  summarise(num = n(), cit = sum(cit))%>%
  select(-rowid)

# A tibble: 2 x 2
    num   cit
  <int> <int>
1     3   289
2     2   111

CodePudding user response:

df<-data.frame(x=c("21070808(136)|19995886(87)|21280165(66)","20226255(57)|21440646(54)"))

df$count<-str_count(df$x, "\\|") 1

df$sum<-str_extract_all(df$x, "(?<=\\()[^()]*(?=\\))")

df$sum<-lapply(df$sum, function(y) sum(as.numeric(y)))

df               
                                        x sum count
1 21070808(136)|19995886(87)|21280165(66) 289     3
2               20226255(57)|21440646(54) 111     2

Regex sources to match data in parenthesis - Pattern to match only characters within parentheses

CodePudding user response:

Here is a Stata solution.

clear 
input str42 problem 
"21070808(136)|19995886(87)|21280165(66)"
"20226255(57)|21440646(54)"
end 

gen count = strlen(problem) - strlen(subinstr(problem, "|", "", .))   1 

* ssc install moss 
moss problem, match("(\([0-9] )") regex 
destring _match*, ignore("(") replace 
egen citations = rowtotal(_match*)

keep problem count citations 
list 

      ------------------------------------------------------------ 
     |                                 problem   count   citati~s |
     |------------------------------------------------------------|
  1. | 21070808(136)|19995886(87)|21280165(66)       3        289 |
  2. |               20226255(57)|21440646(54)       2        111 |
      ------------------------------------------------------------ 
  • Related