Home > Software design >  Classification based on a list of words
Classification based on a list of words

Time:06-13

I have a dataset like this one below:

Id | ArticleName | Pages | Topics             | ...
1  | abcd...     | 9999  | Animals            | ...
2  | aabbcc..    | 8888  | AI, Computer, HiFi | ...
3  | aaabbb      | 7777  | Hot Dog, Animals   | ...
4  | cccbb       | 6666  | Dataset, R         | ...
5  | dddss       | 64    | Hamburger, AI      | ...

Each row of the ds represents an article which has, in the column Topics, a list of words of the topics of the article itself.

Topics have a main Area which refer to. For example, let's say:

  • Nature: (Animals, Plants)
  • Technology: (Computer, HiFi, AI, Intelligent Systems, IOT, Machine Learning)
  • Food: (Pizza, Fast Food, Hamburger, Hot Dog, Salad, Fries)

I've to come up with a result where if the list of topics covered by the article contains a word of the list Nature, for example, I'd have a mark (let's say 1) if the article covers >=1 arguments of the list Nature, and so on with others Areas. If no matches are found we'd have a mark in NC (Not Classified) In other words I need a classification based on the presence of words.

Here's the example taking the ds shown up as input.

Id | ArticleName | Pages | Topics             | Nature | Technology | Food | NC
1  | abcd...     | 9999  | Animals            | 1      | 0          | 0    | 0
2  | aabbcc..    | 8888  | AI, Computer, HiFi | 0      | 1          | 0    | 0
3  | aaabbb      | 7777  | Hot Dog, Animals   | 1      | 0          | 1    | 0
4  | cccbb       | 6666  | Dataset, R         | 0      | 0          | 0    | 1
5  | dddss       | 64    | Hamburger, AI      | 0      | 1          | 1    | 0

CodePudding user response:

Try this

x <- "
Id | ArticleName | Pages | Topics             
1  | abcd...     | 9999  | Animals            
2  | aabbcc..    | 8888  | AI, Computer, HiFi 
3  | aaabbb      | 7777  | Hot Dog, Animals   
4  | cccbb       | 6666  | Dataset, R         
5  | dddss       | 64    | Hamburger, AI      
"
df <- read.table(textConnection(x) , header = T , sep = "|")
#===================================

Nature <- c("Animals","Plants")
Technology <- c("Computer","HiFi", "AI", "Intelligent Systems", "IOT", "Machine Learning")
Food <- c("Pizza", "Fast Food", "Hamburger", "Hot Dog", "Salad", "Fries")

cls <- matrix(0 ,nrow(df) ,4)
colnames(cls) <- c("Nature" , "Technology" ,"Food" , "NC")
i <- 1

#===================================
for(t in df$Topics) {
  x <- do.call(trimws, strsplit(t , ","))
  for (c in x) {
    if (c %in% Nature) cls[i , 1] <- 1
    else if (c %in% Technology) cls[i , 2] <- 1
    else if (c %in% Food) cls[i , 3] <- 1
    else cls[i , 4] <- 1
  }
  i <- i   1
}
#===================================

cls
    Nature Technology Food NC
[1,]      1          0    0  0
[2,]      0          1    0  0
[3,]      1          0    1  0
[4,]      0          0    0  1
[5,]      0          1    1  0


ans <- cbind(df , cls)

ans
#>   Id   ArticleName Pages               Topics Nature Technology Food NC
#> 1  1  abcd...       9999  Animals                  1          0    0  0
#> 2  2  aabbcc..      8888  AI, Computer, HiFi       0          1    0  0
#> 3  3  aaabbb        7777  Hot Dog, Animals         1          0    1  0
#> 4  4  cccbb         6666  Dataset, R               0          0    0  1
#> 5  5  dddss           64  Hamburger, AI            0          1    1  0

Created on 2022-06-12 by the reprex package (v2.0.1)

CodePudding user response:

Based on data on comments

df <- structure(list(Unicode = c("00101", "00101", "00101", "00101", "00101", "00101"),
                     Univ = c("Torino", "Torino", "Torino", "Torino", "Torino", "Torino"),
                     Accession.Number = c("WOS:A1995RF98900069", "WOS:000255232100042", "WOS:000258875900011", "WOS:000260047700020", "WOS:000258050500015", "WOS:000180390600004"),
                     Research.Area = c("BIOCHEMISTRY & MOLECULAR BIOLOGY", "CRITICAL CARE MEDICINE", "INSTRUMENTS & INSTRUMENTATION", "CHEMISTRY, MULTIDISCIPLINARY", "ONCOLOGY", "ONCOLOGY")),
                     row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")


library(qdapTools)
lst <- lapply(df$Research.Area ,
                function(t) do.call(trimws, strsplit(t , "[,&]")))

HotEnc <- mtabulate(lst)
HotEnc

#>   BIOCHEMISTRY CHEMISTRY CRITICAL CARE MEDICINE INSTRUMENTATION INSTRUMENTS
#> 1            1         0                      0               0           0
#> 2            0         0                      1               0           0
#> 3            0         0                      0               1           1
#> 4            0         1                      0               0           0
#> 5            0         0                      0               0           0
#> 6            0         0                      0               0           0
#>   MOLECULAR BIOLOGY MULTIDISCIPLINARY ONCOLOGY
#> 1                 1                 0        0
#> 2                 0                 0        0
#> 3                 0                 0        0
#> 4                 0                 1        0
#> 5                 0                 0        1
#> 6                 0                 0        1

then you can sum (HotEnc$cat1 HotEnc$cat2)/2 to make it in one category.

Created on 2022-06-12 by the reprex package (v2.0.1)

  • Related