T1 T2 T3 T4 T5 1 1 NA NA 1 NA NA 1 1 1 NA 1 NA NA NA NA NA NA NA 1
suppose my dataframe is like this (plz see the picture below, sorry I don't know how to replicate this data in stackoverflow). T1 stands for the first period,T5 stands for the last.
For each row, I want to find a longest spread such that between this two index, there isn't NA appears.
So the outcome for the first row should be from T1 To T2 (index 1 to index2)
the outcome for row 2 should be T3 - T5 the outcome for row 3 should be 0, since there isn't such a satisfied outcome.
and I would like all results to be put into the dataframe, because I have actually tons of rows, so I hope my final dataset would be like:
THANK YOU SO MUCH
CodePudding user response:
sorry I didn't see you changed your question. Here is a simple solution in base R. First, in a list I extract the names of the columns with with contiguous nonNA values. From that I extract the first and last component.
df<-data.frame(T1=c(1,NA,NA,NA),
T2=c(1,NA,1,NA),
T3=c(NA,1,NA,NA),
T4=c(NA,1,NA,NA),
T5=c(1,1,NA,1))
lcon<-apply(df, MARGIN=1, function(x) {!na.contiguous(x)})
df$start_T <- lapply(lcon, function(x) (head(names(x), n=1)))
df$end_T <- lapply(lcon, function(x) (tail(names(x), n=1)))
Then you can assign zero to the sequence where there are not contiguous value, that is were start_T
and end_T
are equal.
But remember, if you are in a situation where you have multiple sequences with the same length, only the last of those will be extracted.
For example, try running the same code but using this dataset as example, where I modified the first row to have two sequences of length 2.
df_m<-data.frame(T1=c(1,NA,NA,NA),
T2=c(1,NA,1,NA),
T3=c(NA,1,NA,NA),
T4=c(1,1,NA,NA),
T5=c(1,1,NA,1))
CodePudding user response:
Normally with indicies you start to count at 0 instead of 1, so this python script will do that. However if you want to the first element to be 1 instead of 0, you can add a 1 at lines 48 and 49
# Take user input and make it a list
input = input().split(' ')
# Count the number of columns
columns = 0
for element in input:
if element.startswith('T'):
columns = 1
else:
break
# Determine the number of rows
rows = int(len(input) / columns) - 1
# Print headers of the table
print ("{:<8} {:<15} {:<10}".format("row", "start_T", "end_T"))
# For every row, loop trough that row
for i in range(rows):
# Determine the first and last index of the row
start = rows 1 i * columns
end = start columns
# Initialize a streak and record variables
# The low and high are the the 2 indicies we are looking for
streak = -1
streak_low = 0
streak_high = 0
record = 0
record_low = 0
record_high = 0
# Loop trough the row
for j, element in enumerate(input[start:end]):
#print(i, j, element)
# If the element is 1, update the streak
if element == "1":
streak = 1
streak_high = j
# Update lower streak boundary if this is the first in the streak
if streak == 0:
streak_low = j
# If the streak is greater then the current record, update the record
if streak > record:
record = streak
record_low = streak_low # 1 Normally with indicies you start to count at 0
record_high = streak_high # 1 if you want to start the count at 1 add the plus 1 here
# If element is not 1 (thus NA) set the streak back to -1
else:
streak = -1
streak_low = 0
streak_high = 0
# Print the row number and start / end index
print ("{:<8} {:<15} {:<10}".format(f"row_{i 1}", record_low, record_high))
CodePudding user response:
An approach using rle
outp <- setNames(data.frame(t(apply(d, 1, function(x){
rl = rle(unlist(x))$lengths
rbind(which.max(rl) * any(rl > 1), (which.max(rl) max(rl) - 1) * any(rl > 1))
}))), c("start_T", "end_T"))
cbind(row = paste0("row_", 1:nrow(outp)), outp)
row start_T end_T
1 row_1 1 2
2 row_2 3 5
3 row_3 0 0
4 row_4 0 0
Data
d <- structure(list(T1 = list(1L, NA, NA, NA), T2 = list(1L, NA, 1L,
NA), T3 = list(NA, 1L, NA, NA), T4 = list(NA, 1L, NA, NA),
T5 = list(1L, 1L, NA, 1L)), class = "data.frame", row.names = c(NA,
-4L))