I am now confused by a problem. I have more than 3,000 observations, each observation is a full text. For example:
text="Ganluo County People's Court of X Province。The plaintiff X, female, born on May, 1980, lives in X County, X Province。The defendant X, male, born on May, 1971, lives in X County, X Province。
It is a divorce dispute, according to 《marriage law》on June 21, 2016。"
Now, I want to extract the information for the plaintiff and defendant, and also I want to know whether this full text contain the word "《marriage law》"(T for yes, F for no)
Thus, I want to have the following results:
text | plaintiff | defendant | law |
---|---|---|---|
Ganluo County People's Court of X Province。The plaintiff X, female, born on May, 1980, lives in X County, X Province。The defendant X, male, born on May, 1971, lives in X County, X Province。It is a divorce dispute, according to 《marriage law》on June 21, 2016。 | The plaintiff X, female, born on May, 1980, lives in X County, X Province。 | The defendant X, male, born on May, 1971, lives in X County, X Province。 | T |
I tried several times, but it does not work. Many thanks for your kind help!
CodePudding user response:
How tight are the patterns you've shown here? Is the plaintiff always in the second sentence? Does the defendant's description always follow the plaintiff? Is punctuation always used?
Here's a method that works with this data. This method does not assume any given order, but it does assume punctuation was used.
In the regex used you see 'The plaintiff' (or defendant), followed by .*
, which means followed by anything, then ?
, which tells us that we want the first occurrence of the lookahead. The lookahead, or where we want the regex to stop looking, is documented in (?= )
. You have oddly encoded 。at the end of the sentences (assuming this was translated).
If you have periods or another recognized special character in your real data, you'll have to escape it. In this regex, you saw that the period followed by the asterisk was coding for ...and anything else... so if you were looking for a period or an asterisk, you'd have to 'escape' it so that the regex process knows that you meant the character literally.
library(tidyverse)
library(stringi)
tdf <- data.frame(oText = text) %>%
mutate(plaintiff = stri_extract_first_regex(oText, 'The plaintiff.*?(?=(。))'),
defendent = stri_extract_first_regex(oText, 'The defendant.*?(?=(。))'),
law = str_detect(oText, 'marriage law'))
If the patterns are strict, you could probably use dplyr::separate
to make this even easier.
CodePudding user response:
An approach using str_extract
and sub
. The substitution removes any follow up sentences, if they exists. So the detected plaintiff and defendant can only be one sentence long (。
as the separator).
library(dplyr)
library(stringr)
tibble(text) %>%
mutate(plaintiff = sub("(。).*", "\\1", str_extract(text, "The plaintiff.*。")),
defendant = sub("(。).*", "\\1", str_extract(text, "The defendant.*。")),
law = grepl("《marriage law》", text)) %>%
print(Inf)
# A tibble: 1 × 4
text plain…¹ defen…² law
<chr> <chr> <chr> <lgl>
1 "Ganluo County People's Court of X Province。The plaint… The pl… The de… TRUE
# … with abbreviated variable names ¹plaintiff, ²defendant
full output
# A tibble: 1 × 4
text
<chr>
1 "Ganluo County People's …
plaintiff
<chr>
1 The plaintiff X, female, born on May, 1980, lives in X County, X Province。
defendant
<chr>
1 The defendant X, male, born on May, 1971, lives in X County, X Province。
law
<lgl>
1 TRUE