Calculate share of strings in list, based on other list of strings-CodePudding

What I am trying to do: I want tot calculate the relative representation of certain surname (groups) that are defined in another list. I want to know the share of these surnames within a greater group.

Example:

First I have a certain list of 'special surnames' that were defined trough certain criteria

N1 <- data.frame(c("Smith", "Mountain", "Friedman", "Keynes"))

, next I have a bigger list, that contains more and double entries

N2 <- data.frame(c("Delange", "Smith", "Mountain", "Keynes", "Woodman", "Smith", "Keynes", "Keynes"))

Now I want to calculate how many times the names, defined in the first list, occur in the second list. So I can know there are in total 6 entries in the second list that check the criteria of the first list. Then I would be able to know the share of 'special surnames' in this list.

My real dataframes are quite extensive, and sadly I haven't been able to find a solution this problem even though this sounds rather easy to solve.

Why/The bigger picture: I am trying to track elites over times through surnames. First a set of elite surnames k defined in generation t-1. Secondly, to calculate the relative representation of these surnames k in generation t, t 1, ..., t n: (Share of surnames k in elite group surnames generation t)/(Share of surnames k in general population t)

CodePudding user response：

Given your data (add a name to your columns)

N1 <- data.frame("sur"=c("Smith", "Mountain", "Friedman", "Keynes"))
N2 <- data.frame("sur"=c("Delange", "Smith", "Mountain", "Keynes", "Woodman", "Smith", "Keynes", "Keynes" ))

using table and merge

> table(merge(N1,N2,by="sur"))
  Keynes Mountain    Smith 
       3        1        2

for a total share using match

> mean(complete.cases(match(N2$sur,N1$sur)))
[1] 0.75

CodePudding user response：

Here is a solution using tidyverse packages. The first two lines do most of the work, the last two catch the case that some names have zero occurrences. It might be better to replace the last two lines with dplyr::complete in a more complex case.

library(dplyr)
library(tidyr)

N1 <- data.frame(name = c("Smith", "Mountain", "Friedman", "Keynes"))
N2 <- data.frame(name = c("Delange", "Smith", "Mountain", "Keynes", "Woodman", "Smith", "Keynes", "Keynes"))
                   
                   
inner_join(N1, N2) |> # combine datasets, keeping elements in common
  count(name) |> # count
  right_join(N1) |> #add any elements missing from N1
  replace_na(list(n = 0)) #replace NA with zero


      name n
1   Keynes 3
2 Mountain 1
3    Smith 2
4 Friedman 0

CodePudding user response：

Use %in% to get the matches, sum it and divide by the rows of N2 to get the share of special surnames.

sum(N2[,1] %in% N1[,1]) / nrow(N2)
#[1] 0.75