I have a dataset like this one below:
Id | ArticleName | Pages | Topics | ...
1 | abcd... | 9999 | Animals | ...
2 | aabbcc.. | 8888 | AI, Computer, HiFi | ...
3 | aaabbb | 7777 | Hot Dog, Animals | ...
4 | cccbb | 6666 | Dataset, R | ...
5 | dddss | 64 | Hamburger, AI | ...
Each row of the ds represents an article which has, in the column Topics, a list of words of the topics of the article itself.
Topics have a main Area which refer to. For example, let's say:
- Nature: (Animals, Plants)
- Technology: (Computer, HiFi, AI, Intelligent Systems, IOT, Machine Learning)
- Food: (Pizza, Fast Food, Hamburger, Hot Dog, Salad, Fries)
I've to come up with a result where if the list of topics covered by the article contains a word of the list Nature, for example, I'd have a mark (let's say 1) if the article covers >=1 arguments of the list Nature, and so on with others Areas. If no matches are found we'd have a mark in NC (Not Classified) In other words I need a classification based on the presence of words.
Here's the example taking the ds shown up as input.
Id | ArticleName | Pages | Topics | Nature | Technology | Food | NC
1 | abcd... | 9999 | Animals | 1 | 0 | 0 | 0
2 | aabbcc.. | 8888 | AI, Computer, HiFi | 0 | 1 | 0 | 0
3 | aaabbb | 7777 | Hot Dog, Animals | 1 | 0 | 1 | 0
4 | cccbb | 6666 | Dataset, R | 0 | 0 | 0 | 1
5 | dddss | 64 | Hamburger, AI | 0 | 1 | 1 | 0
CodePudding user response:
Try this
x <- "
Id | ArticleName | Pages | Topics
1 | abcd... | 9999 | Animals
2 | aabbcc.. | 8888 | AI, Computer, HiFi
3 | aaabbb | 7777 | Hot Dog, Animals
4 | cccbb | 6666 | Dataset, R
5 | dddss | 64 | Hamburger, AI
"
df <- read.table(textConnection(x) , header = T , sep = "|")
#===================================
Nature <- c("Animals","Plants")
Technology <- c("Computer","HiFi", "AI", "Intelligent Systems", "IOT", "Machine Learning")
Food <- c("Pizza", "Fast Food", "Hamburger", "Hot Dog", "Salad", "Fries")
cls <- matrix(0 ,nrow(df) ,4)
colnames(cls) <- c("Nature" , "Technology" ,"Food" , "NC")
i <- 1
#===================================
for(t in df$Topics) {
x <- do.call(trimws, strsplit(t , ","))
for (c in x) {
if (c %in% Nature) cls[i , 1] <- 1
else if (c %in% Technology) cls[i , 2] <- 1
else if (c %in% Food) cls[i , 3] <- 1
else cls[i , 4] <- 1
}
i <- i 1
}
#===================================
cls
Nature Technology Food NC
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 1 0 1 0
[4,] 0 0 0 1
[5,] 0 1 1 0
ans <- cbind(df , cls)
ans
#> Id ArticleName Pages Topics Nature Technology Food NC
#> 1 1 abcd... 9999 Animals 1 0 0 0
#> 2 2 aabbcc.. 8888 AI, Computer, HiFi 0 1 0 0
#> 3 3 aaabbb 7777 Hot Dog, Animals 1 0 1 0
#> 4 4 cccbb 6666 Dataset, R 0 0 0 1
#> 5 5 dddss 64 Hamburger, AI 0 1 1 0
Created on 2022-06-12 by the reprex package (v2.0.1)
CodePudding user response:
Based on data on comments
df <- structure(list(Unicode = c("00101", "00101", "00101", "00101", "00101", "00101"),
Univ = c("Torino", "Torino", "Torino", "Torino", "Torino", "Torino"),
Accession.Number = c("WOS:A1995RF98900069", "WOS:000255232100042", "WOS:000258875900011", "WOS:000260047700020", "WOS:000258050500015", "WOS:000180390600004"),
Research.Area = c("BIOCHEMISTRY & MOLECULAR BIOLOGY", "CRITICAL CARE MEDICINE", "INSTRUMENTS & INSTRUMENTATION", "CHEMISTRY, MULTIDISCIPLINARY", "ONCOLOGY", "ONCOLOGY")),
row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
library(qdapTools)
lst <- lapply(df$Research.Area ,
function(t) do.call(trimws, strsplit(t , "[,&]")))
HotEnc <- mtabulate(lst)
HotEnc
#> BIOCHEMISTRY CHEMISTRY CRITICAL CARE MEDICINE INSTRUMENTATION INSTRUMENTS
#> 1 1 0 0 0 0
#> 2 0 0 1 0 0
#> 3 0 0 0 1 1
#> 4 0 1 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> MOLECULAR BIOLOGY MULTIDISCIPLINARY ONCOLOGY
#> 1 1 0 0
#> 2 0 0 0
#> 3 0 0 0
#> 4 0 1 0
#> 5 0 0 1
#> 6 0 0 1
then you can sum (HotEnc$cat1 HotEnc$cat2)/2
to make it in one category.
Created on 2022-06-12 by the reprex package (v2.0.1)