R - Extract uppercase words whatever the following characters may be-CodePudding

I am struggling right now with regular expressions in R. I have strings that look like this :

ID	Strings
A1	INGÉNIEUR AGRO. ( DEA pédologie IAE). 27 ans whatever whatever.
A2	ARCHITECTE DPLG 29 ans. 6 ans d’exp. prof, blablablabla
A3	PROJETEUR-COMPOSITEUR- ARCHITECTURE 32 ans.

What I want to do is extract all that is in uppercase at the beginning, which corresponds to these people's job. The result would be :

ID	Strings
A1	INGÉNIEUR AGRO
A2	ARCHITECTE DPLG
A3	PROJETEUR-COMPOSITEUR- ARCHITECTURE

As you can see, the data is dirty and the very structure of strings can vary from one case to another. For most strings, I have been able to extract what I wanted thanks to this chunk of code :

data.frame %>%
  mutate(job = str_extract(string, "[[:upper:]]{2,}. \\W"))

But, for a reason that I have been trying to find without any success, it fails for some strings. For example, it will return something like that :

ID	Strings
A1	INGÉNIEUR AGRO. ( DEA pédologie IAE). 27 ans whatever whatever.
A2	ARCHITECTE DPLG
A3	PROJETEUR-COMPOSITEUR-

As you see, A2 is ok, but my code seems then to have completely ignored A1, which stays absolutely similar to what it was before. Or, for A3, it failed to extract the next uppercase word "ARCHITECTURE", I think because of the space between "-" and "ARCHITECTURE".

I have been testing many regex patterns, but I must admit I'm no expert in the matter and I fail to understand all the intricacies of this syntax. Then my question would be the following : is there any pattern that would allow me to catch all this group of uppercase words,and only that, whatever the following characters may be, and keeping in mind that the structure of these groups of uppercase words can vary (from simple forms like "WORDWORD word word" to more complex forms like "WORD- WORD (word.)".

I hope I am quite clear... Thank you very much !

(Here is the example dataframe if you want to fastly reproduce it)

data.frame(ID = c("A1", "A2", "A3"),
           Strings = c("INGÉNIEUR AGRO. (  DEA pédologie   IAE). 27 ans whatever whatever.",
                       "ARCHITECTE DPLG 29 ans. 6 ans d’exp. prof, blablablabla",
                       "PROJETEUR-COMPOSITEUR- ARCHITECTURE 32 ans."))

CodePudding user response：

You can use

library(stringr)
df <- data.frame(ID = c("A1", "A2", "A3"),
            Strings = c("INGÉNIEUR AGRO. (  DEA pédologie   IAE). 27 ans whatever whatever.",
                        "ARCHITECTE DPLG 29 ans. 6 ans d’exp. prof, blablablabla",
                        "PROJETEUR-COMPOSITEUR- ARCHITECTURE 32 ans."))

df$Job <- str_extract(df$Strings, '^[\\p{L}\\s-]*\\p{L}')

The ^[\p{L}\s-]*\p{L} regex matches

^ - start of string
[\p{L}\s-]* - zero or more Unicode letters, whitespace or - chars
\p{L} - a Unicode letter (this is what the extracted value should end with).

See the regex demo and the R demo online.

CodePudding user response：

While @Wiktor's solution is better (more concise), maybe you want to see how your approach can be fixed:

library(stringr)
library(dplyr)
df %>%
  mutate(job = str_extract(Strings, "[[[:upper:]]\\s-] [[:upper:]]"))
  ID                                                            Strings
1 A1 INGÉNIEUR AGRO. (  DEA pédologie   IAE). 27 ans whatever whatever.
2 A2            ARCHITECTE DPLG 29 ans. 6 ans d’exp. prof, blablablabla
3 A3                        PROJETEUR-COMPOSITEUR- ARCHITECTURE 32 ans.
                                  job
1                      INGÉNIEUR AGRO
2                     ARCHITECTE DPLG
3 PROJETEUR-COMPOSITEUR- ARCHITECTURE

[[[:upper:]]\\s-] defines a character class for upper-case letters, whitespace, and - occurring at least once, whereas...
[[:upper:]] ... makes sure the extracted string ends on an upper-case letter (rather than, for example, a whitespace) (credit to @Wiktor for this idea)

CodePudding user response：

Thank you both for your answers ! It works perfectly well and your explanations actually helped me understand a bit more how regex works :)