how to create (cases, controls, and NA) in a phenotype column based on IDs in another txt file-CodePudding

I am preparing a phenotype file for a GWAS. I found this conversation helpful but it is not quite what I need to do. I have a large 44k participant txt file (containing all cohort participants). Column1=FID, Column=IID, Coumn3=pseudoID I want to create a 4th column with my phenotype of interest (1=case, 0=control, NA=all other participants). I have 2 separate text files that contain just a column with the pseudoID for my controls and antother txt file for my cases.

(1) How do i create a header for the 4th column?

(2) How do i join the pseudoID from the separate control and case txt file to create a 0 or 1 as required in the 4th column.

(3) How do the remaining empty rows in the 4th column become NA?

I will be using Regenie for the GWAS. I am more familiar with linux less so R. Any help would be appreciated. Thank you.

44k participant file txt

FID IID Pseudo_ID
1 150023532 E78GJHI
1 150023457 E96GH25
1 150075826 E56HFT7 
1 150065943 EH87HN7
1 150034923 ENM8H53

Case txt

E78GJHI
ENM8H53

Control txt

E96GH25
EH87HN7

The expected output

Phenotype File result

FID IID Pseudo_ID ICD_10
1 150023532 E78GJHI 1
1 150023457 E96GH25 0
1 150075826 E56HFT7 NA
1 150065943 EH87HN7 0
1 150034923 ENM8H5 1

CodePudding user response：

You can directly construct a vector for that 4th column with the present information and add it to the previous data.frame.

I included code to read the dataset, but created the values directly to test the code on itself.

# ppl <- read.csv("Control.txt", sep = " ")
ppl <- data.frame(FID = 1, 
                  IID = c(150023532, 150023457, 150075826, 
                          150065943, 150034923),
                  Pseudo_ID = c("E78GJHI", "E96GH25", "E56HFT7", 
                                "EH87HN7", "ENM8H53"))
ppl
# FID       IID Pseudo_ID
# 1   1 150023532   E78GJHI
# 2   1 150023457   E96GH25
# 3   1 150075826   E56HFT7
# 4   1 150065943   EH87HN7
# 5   1 150034923   ENM8H53

# case <- readLines(file("Case.txt"))
case <- c("E78GJHI", "ENM8H53")
case
# [1] "E78GJHI" "ENM8H53"

# ctrl <- readLines(file("Control.txt"))
ctrl <- c("E96GH25", "EH87HN7")
ctrl
# [1] "E96GH25" "EH87HN7"

I just add the column and it's defined by the presence of the Pseudo_ID values in the case and control vectors. I bet it can be more readible with other packages but this is for easier understanding. ifelse return a vector of the same size as input with the 2 values. Here if pseudo_IP is in ctrl, it return 0 or else it return NA, same with case.

For a data.frame named df, df$name will read the colunm named and df$name <- ... will edit or if absent create the new column.

ppl$ICD_10 <- ifelse(ppl$Pseudo_ID %in% case, 1, 
                     ifelse(ppl$Pseudo_ID %in% ctrl, 0, NA))
ppl
# FID       IID Pseudo_ID ICD_10
# 1   1 150023532   E78GJHI      1
# 2   1 150023457   E96GH25      0
# 3   1 150075826   E56HFT7     NA
# 4   1 150065943   EH87HN7      0
# 5   1 150034923   ENM8H53      1

CodePudding user response：

Is this what you are trying to do? It might not be the most efficient one but you could do the following.

case file (add header and create $2 with status ($2=1 for cases))

awk 'BEGIN{print "Pseudo_ID","ICD_10"}; { print $1,$2=1}' OFS=" " case.txt > case_1.txt

control file (do not add header but create $2 with status ($2=0 for controls))

awk '{ print $1,$2=0}' OFS=" " control.txt > control_1.txt

Merge the two files together

cat case_1.txt control_1.txt > case_control.txt

Match case_control.txt with the phenotype file to get the desired output

awk 'BEGIN {FS=OFS=" "} NR==FNR {a[$1]=$2;next}{print $0, ($3 in a ? a[$3]:"NA")}' case_control.txt phenotype.txt 

FID IID Pseudo_ID ICD_10
1 150023532 E78GJHI 1
1 150023457 E96GH25 0
1 150075826 E56HFT7  NA
1 150065943 EH87HN7 0
1 150034923 ENM8H53 1