I have a large dataset showing the social network links in a corpus. I want to extract just the entities from this corpus. Within the dataset (sample below), the entities can be extracted by capturing the first set of values in the entity2 column for the first entity in each paragraph.
So my sample dataset:
structure(list(X = c(6166L, 6168L, 6170L, 6175L, 6177L, 6180L,
34062L, 34063L, 34064L, 34065L, 34066L), entity1 = c("Epicurus",
"Epicurus", "Epicurus", "Charles Lamb", "Charles Lamb", "Roman",
"Egypt", "Egypt", "Egypt", "India", "India"), type1 = c("person",
"person", "person", "person", "person", "group", "geopolitical area",
"geopolitical area", "geopolitical area", "geopolitical area",
"geopolitical area"), entity2 = c("Epic", "Charles Lamb", "Roman",
"Charles Lamb", "Roman", "Roman", "Egypt", "India", "Arabia",
"India", "Arabia"), type2 = c("person", "person", "group", "person",
"group", "group", "geopolitical area", "geopolitical area", "geopolitical area",
"geopolitical area", "geopolitical area"), text = c("plutarch.txt",
"plutarch.txt", "plutarch.txt", "plutarch.txt", "plutarch.txt",
"plutarch.txt", "civilization.txt", "civilization.txt", "civilization.txt",
"civilization.txt", "civilization.txt"), paragraph = c(49L, 49L,
49L, 49L, 49L, 49L, 15L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c(NA,
-11L))
would just include the rows for "Epicurus" and "Egypt". Dataset is 150,000 lines, so will need to be done programmatically. The paragraphs are numbered in the respective column and these numbers reset for each work so they won't be unique. I'm not sure if tidyverse has anything for this, or if I need to do something like extracting the first set of rows with duplicated values in the entity1 column for each paragraph.
Any help is appreciated. Thanks!
CodePudding user response:
df |>
group_by(paragraph, text) |>
filter(entity1 == first(entity1))
# A tibble: 6 × 7
# Groups: paragraph, text [2]
X entity1 type1 entity2 type2 text paragraph
<int> <chr> <chr> <chr> <chr> <chr> <int>
1 6166 Epicurus person Epic person plut… 49
2 6168 Epicurus person Charles Lamb person plut… 49
3 6170 Epicurus person Roman group plut… 49
4 34062 Egypt geopolitical area Egypt geopolitical ar… civi… 15
5 34063 Egypt geopolitical area India geopolitical ar… civi… 15
6 34064 Egypt geopolitical area Arabia geopolitical ar… civi… 15
CodePudding user response:
You need slice_head
, which gets the first n values of each group. slice(1)
would also work instead, because it get the nth row of each group.
df %>%
group_by(paragraph) %>%
slice_head(n = 1)
# A tibble: 2 x 7
# Groups: paragraph [2]
X entity1 type1 entity2 type2 text paragraph
<int> <chr> <chr> <chr> <chr> <chr> <int>
1 34062 Egypt geopolitical area Egypt geopolitical area civilization.txt 15
2 6166 Epicurus person Epic person plutarch.txt 49
CodePudding user response:
library(data.table)
setDT(df)
df[, .SD[1], by = paragraph]
for big data, this option may be effective
df[df[, .I[1], by = paragraph]$V1, ]
X entity1 type1 entity2 type2 text paragraph
1: 6166 Epicurus person Epic person plutarch.txt 49
2: 34062 Egypt geopolitical area Egypt geopolitical area civilization.txt 15