I have the following dataset with number of rows in the millions:
| operating aln | operating fltnum | aln1 | flt1 | aln2| flt2| aln3| flt3|...|aln8|flt8|
Each operating airline and flight may have up to 8 code share flights (aln1-8 and flt 1-8). So some columns (aln2- aln8) may contain data, and some may not. My goal is to get a unique airline and flight number from all the columns combined. I wrote a code that I know is very inefficient (but it does the trick)
#separate each airline and flight number into its own dataframe
t1 <- df[,c("aln1","flt1")]
t2 <- df[,c("aln2","flt2")]
t3 <- df[,c("aln3","flt3")]
t4 <- df[,c("aln4","flt4")]
t5 <- df[,c("aln5","flt5")]
t6 <- df[,c("aln6","flt6")]
t7 <- df[,c("aln7","flt7")]
t8 <- df[,c("aln8","flt8")]
# get rid of all rows with NA in them
t1 <- na.omit(t1)
t2 <- na.omit(t2)
t3 <- na.omit(t3)
t4 <- na.omit(t4)
t5 <- na.omit(t5)
t6 <- na.omit(t6)
t7 <- na.omit(t7)
t8 <- na.omit(t8)
# rename all variables so I may combine them together in one dataframe
t1 <- rename(t1,aln = aln1, fltnum = flt1)
t2 <- rename(t2,aln = aln2, fltnum = flt2)
t3 <- rename(t3,aln = aln3, fltnum = flt3)
t4 <- rename(t4,aln = aln4, fltnum = flt4)
t5 <- rename(t5,aln = aln5, fltnum = flt5)
t6 <- rename(t6,aln = aln6, fltnum = flt6)
t7 <- rename(t7,aln = aln7, fltnum = flt7)
t8 <- rename(t8,aln = aln8, fltnum = flt8)
# combine and get a unique sort
cshares_010 <- rbind(t1,t2,t3,t4,t5,t6,t7,t8)
cshares_010<- distinct(cshares_010, aln, fltnum)
My question is if anyone knows a more 'efficient' way of doing this?
I tried creating a function that will use a loop to do all the above but that can't seem to make it work (when I worked with SAS I created a macro and used the loop number in {} to create and use variables).
For example to create tables t1-t8:
names <- list("t1,"t2,"t3","t4","t5","t6","t7","t8")
pos <- list(1,2,3,4,5,6,7,8)
for (j in 1:8)
{
t[[j]] <- df[, c(3 (pos[[j]]*2),(4 (pos[[j]]*2)] # so that I get a c(5,6) for t1, c(7,8) for t2,etc..
}
The code seems to choose the correct columns in each loop (so the equations work when i test them separately), but R gives an error when running the function stating "more elements supplied than they are to replace"
Appreciate your thoughts and advice.
CodePudding user response:
The following should work.
You can assign
the values to a new dataframe, then work on this new dataframe. Finally you can assign the correct dataframe name at the end. It is sometimes complex to assign the name of your dataframe at the start of the loop because harder to work on it.
You can also use the function get(paste("t", i, sep=""))
to output a dataframe in a loop.
### Initiating empty list
listOfDf <- list()
### Loop
for(i in 1:8){
# Assign you variables to a dummy variable, simpler when you want to work on a df in a loop
assign("newDf", df[, c(paste("aln", i, sep=""), paste("flt", i, sep=""))])
# NA omit
newDf <- na.omit(newDf)
# Rename variables
newDf <- rename(newDf,
aln=paste("aln", i, sep=""),
flt=paste("flt", i, sep=""))
# Assign the reworked dataframe to ti
assign(paste("t", i, sep=""), newDf)
# Store each dataframe in a list
listOfDf[[i]] <- newDf
}
### Rbind all your dataframes
do.call(rbind, listOfDf)
CodePudding user response:
Maybe build a new data.frame
after unlist
ing.
unique(
na.omit(
data.frame(
aln = unlist(df[,grep("aln", colnames(df))], 0, 0),
flt = unlist(df[,grep("flt", colnames(df))], 0, 0)
)
)
)
#> aln flt
#> 1 A 1
#> 2 B 2
#> 3 C 3
#> 4 D 4
#> 5 E 5
#> 6 F 6
#> 7 G 7
#> 8 H 8
Data
df <- as.data.frame(
c(
as.list(LETTERS[1:8]),
as.list(1:8)
)[rep(1:8, each = 2) c(0, 8)]
)
df <- setNames(df, paste0(rep(c("aln", "flt"), 8), rep(1:8, each = 2)))
df
#> aln1 flt1 aln2 flt2 aln3 flt3 aln4 flt4 aln5 flt5 aln6 flt6 aln7 flt7 aln8
#> 1 A 1 B 2 C 3 D 4 E 5 F 6 G 7 H
#> flt8
#> 1 8