Slicing dataset for relevant information-CodePudding

I have a large dataset showing the social network links in a corpus. I want to extract just the entities from this corpus. Within the dataset (sample below), the entities can be extracted by capturing the first set of values in the entity2 column for the first entity in each paragraph.

So my sample dataset:

structure(list(X = c(6166L, 6168L, 6170L, 6175L, 6177L, 6180L, 
34062L, 34063L, 34064L, 34065L, 34066L), entity1 = c("Epicurus", 
"Epicurus", "Epicurus", "Charles Lamb", "Charles Lamb", "Roman", 
"Egypt", "Egypt", "Egypt", "India", "India"), type1 = c("person", 
"person", "person", "person", "person", "group", "geopolitical area", 
"geopolitical area", "geopolitical area", "geopolitical area", 
"geopolitical area"), entity2 = c("Epic", "Charles Lamb", "Roman", 
"Charles Lamb", "Roman", "Roman", "Egypt", "India", "Arabia", 
"India", "Arabia"), type2 = c("person", "person", "group", "person", 
"group", "group", "geopolitical area", "geopolitical area", "geopolitical area", 
"geopolitical area", "geopolitical area"), text = c("plutarch.txt", 
"plutarch.txt", "plutarch.txt", "plutarch.txt", "plutarch.txt", 
"plutarch.txt", "civilization.txt", "civilization.txt", "civilization.txt", 
"civilization.txt", "civilization.txt"), paragraph = c(49L, 49L, 
49L, 49L, 49L, 49L, 15L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c(NA, 
-11L))

would just include the rows for "Epicurus" and "Egypt". Dataset is 150,000 lines, so will need to be done programmatically. The paragraphs are numbered in the respective column and these numbers reset for each work so they won't be unique. I'm not sure if tidyverse has anything for this, or if I need to do something like extracting the first set of rows with duplicated values in the entity1 column for each paragraph.

Any help is appreciated. Thanks!

CodePudding user response：

df |>
  group_by(paragraph, text) |>
  filter(entity1 == first(entity1))

  # A tibble: 6 × 7
# Groups:   paragraph, text [2]
      X entity1  type1             entity2      type2            text  paragraph
  <int> <chr>    <chr>             <chr>        <chr>            <chr>     <int>
1  6166 Epicurus person            Epic         person           plut…        49
2  6168 Epicurus person            Charles Lamb person           plut…        49
3  6170 Epicurus person            Roman        group            plut…        49
4 34062 Egypt    geopolitical area Egypt        geopolitical ar… civi…        15
5 34063 Egypt    geopolitical area India        geopolitical ar… civi…        15
6 34064 Egypt    geopolitical area Arabia       geopolitical ar… civi…        15

CodePudding user response：

You need slice_head, which gets the first n values of each group. slice(1) would also work instead, because it get the nth row of each group.

df %>% 
  group_by(paragraph) %>% 
  slice_head(n = 1)

# A tibble: 2 x 7
# Groups:   paragraph [2]
      X entity1  type1             entity2 type2             text             paragraph
  <int> <chr>    <chr>             <chr>   <chr>             <chr>                <int>
1 34062 Egypt    geopolitical area Egypt   geopolitical area civilization.txt        15
2  6166 Epicurus person            Epic    person            plutarch.txt            49

CodePudding user response：

library(data.table)
setDT(df)
df[, .SD[1], by = paragraph]

for big data, this option may be effective

df[df[, .I[1], by = paragraph]$V1, ]

       X  entity1             type1 entity2             type2             text paragraph
1:  6166 Epicurus            person    Epic            person     plutarch.txt        49
2: 34062    Egypt geopolitical area   Egypt geopolitical area civilization.txt        15