Home > Software engineering >  How to standardize non-numeric characters in a column?
How to standardize non-numeric characters in a column?

Time:11-30

I need to give the same name (values) to a non-numeric characters in a column composed by universities. An example of my table listed below. Of course, there is many other names of universities, people and columns. I just need to change this part of the data frame.

Name Affiliation
Jose Ramayana OXFORD UNIVERSITY
Andres Andresius OFORD UNIVERSITY
Pepito Perez UNIVERSIDAD NACIONAL
Cacolo Osorio Universidad Nacional de Bogotá
Maleja Patras Unievrsidad del Valle
Tigre Tony Universidad Nacional
Pocho Valencia UNIVERSIDAD DEL VALLE
Puti Gutierrez OXFORD UNIVERSITY
Chuchi Lopez UPTC
Ganso Salazar Uptc
Santiago Andrade PONTIFICIA UNIVERSIDAD JAVERIANA
Andrés Tigreros JAVERIANA CALI

I was trying to use this code but I justo got many replications of the same person at least 10 times.

DB_CO1<- DB_CO %>%
        mutate(FinalAssociation = map(affiliation, ~DB_CO$affiliation[str_detect(.x,DB_CO$affiliation)])) %>% 
          unnest (cols = c(FinalAssociation))

Desired result: that all the values in affiliation stay as the same of some way

Name Affiliation
Jose Ramayana OXFORD UNIVERSITY
Andres Andresius OXFORD UNIVERSITY
Pepito Perez UNIVERSIDAD NACIONAL DE BOGOTÁ
Cacolo Osorio UNIVERSIDAD NACIONAL DE BOGOTÁ
Maleja Patras UNIVERSIDAD DEL VALLE
Tigre Tony UNIVERSIDAD NACIONAL DE BOGOTÁ
Pocho Valencia UNIVERSIDAD DEL VALLE
Puti Gutierrez OXFORD UNIVERSITY
Chuchi Lopez UPTC
Ganso Salazar UPTC
Santiago Andrade PONTIFICIA UNIVERSIDAD JAVERIANA CALI
Andrés Tigreros PONTIFICIA UNIVERSIDAD JAVERIANA CALI

Thanks a lot in advance for your help.

CodePudding user response:

This agrep solution relies on several assumptions.

  1. A fuzzy match between the items is possible (i.e. no heavily abbreviated names like UN etc)
  2. The longer string is the desired name.
  3. No ambiguities occurs.
dat_n <- sapply( dat$Affiliation, function(x)
  dat$Affiliation[agrep(x,dat$Affiliation,ignore.case = TRUE)] )

dat$Affiliation_new <- toupper( unlist(sapply( dat_n, function(x)
  x[which.max( nchar(x) )] )) )

               Name                      Affiliation
1     Jose Ramayana                OXFORD UNIVERSITY
2  Andres Andresius                 OFORD UNIVERSITY
3      Pepito Perez             UNIVERSIDAD NACIONAL
4     Cacolo Osorio   Universidad Nacional de Bogotá
5     Maleja Patras            Unievrsidad del Valle
6        Tigre Tony             Universidad Nacional
7    Pocho Valencia            UNIVERSIDAD DEL VALLE
8    Puti Gutierrez                OXFORD UNIVERSITY
9      Chuchi Lopez                             UPTC
10    Ganso Salazar                             Uptc
11 Santiago Andrade PONTIFICIA UNIVERSIDAD JAVERIANA
12  Andrés Tigreros                   JAVERIANA CALI
                    Affiliation_new
1                 OXFORD UNIVERSITY
2                 OXFORD UNIVERSITY
3    UNIVERSIDAD NACIONAL DE BOGOTÁ
4    UNIVERSIDAD NACIONAL DE BOGOTÁ
5             UNIEVRSIDAD DEL VALLE
6    UNIVERSIDAD NACIONAL DE BOGOTÁ
7             UNIEVRSIDAD DEL VALLE
8                 OXFORD UNIVERSITY
9                              UPTC
10                             UPTC
11 PONTIFICIA UNIVERSIDAD JAVERIANA
12                   JAVERIANA CALI

Data

dat <- structure(list(Name = c("Jose Ramayana", "Andres Andresius",
"Pepito Perez", "Cacolo Osorio", "Maleja Patras", "Tigre Tony",
"Pocho Valencia", "Puti Gutierrez", "Chuchi Lopez", "Ganso Salazar",
"Santiago Andrade", "Andrés Tigreros"), Affiliation = c("OXFORD UNIVERSITY",
"OFORD UNIVERSITY", "UNIVERSIDAD NACIONAL", "Universidad Nacional de Bogotá",
"Unievrsidad del Valle", "Universidad Nacional", "UNIVERSIDAD DEL VALLE",
"OXFORD UNIVERSITY", "UPTC", "Uptc", "PONTIFICIA UNIVERSIDAD JAVERIANA",
"JAVERIANA CALI")), class = "data.frame", row.names = c(NA, -12L
))
  • Related