Home > OS >  for each row, how to return the index such that there isn't na after this index?
for each row, how to return the index such that there isn't na after this index?

Time:11-20

T1 T2 T3 T4 T5 1 1 NA NA 1 NA NA 1 1 1 NA 1 NA NA NA NA NA NA NA 1

suppose my dataframe is like this (plz see the picture below, sorry I don't know how to replicate this data in stackoverflow). T1 stands for the first period,T5 stands for the last.

enter image description here

For each row, I want to find a longest spread such that between this two index, there isn't NA appears.

So the outcome for the first row should be from T1 To T2 (index 1 to index2)

the outcome for row 2 should be T3 - T5 the outcome for row 3 should be 0, since there isn't such a satisfied outcome.

and I would like all results to be put into the dataframe, because I have actually tons of rows, so I hope my final dataset would be like:

enter image description here

THANK YOU SO MUCH

CodePudding user response:

sorry I didn't see you changed your question. Here is a simple solution in base R. First, in a list I extract the names of the columns with with contiguous nonNA values. From that I extract the first and last component.

df<-data.frame(T1=c(1,NA,NA,NA),
               T2=c(1,NA,1,NA),
               T3=c(NA,1,NA,NA),
               T4=c(NA,1,NA,NA),
               T5=c(1,1,NA,1))


lcon<-apply(df, MARGIN=1, function(x) {!na.contiguous(x)})
df$start_T <- lapply(lcon, function(x) (head(names(x), n=1)))
df$end_T <- lapply(lcon, function(x) (tail(names(x), n=1)))

Then you can assign zero to the sequence where there are not contiguous value, that is were start_T and end_T are equal. But remember, if you are in a situation where you have multiple sequences with the same length, only the last of those will be extracted. For example, try running the same code but using this dataset as example, where I modified the first row to have two sequences of length 2.

df_m<-data.frame(T1=c(1,NA,NA,NA),
               T2=c(1,NA,1,NA),
               T3=c(NA,1,NA,NA),
               T4=c(1,1,NA,NA),
               T5=c(1,1,NA,1))

CodePudding user response:

Normally with indicies you start to count at 0 instead of 1, so this python script will do that. However if you want to the first element to be 1 instead of 0, you can add a 1 at lines 48 and 49

# Take user input and make it a list
input = input().split(' ')

# Count the number of columns
columns = 0
for element in input:
  if element.startswith('T'):
    columns  = 1
  else:
    break

# Determine the number of rows
rows = int(len(input) / columns) - 1

# Print headers of the table
print ("{:<8} {:<15} {:<10}".format("row", "start_T", "end_T"))

# For every row, loop trough that row
for i in range(rows):
  # Determine the first and last index of the row
  start = rows   1   i * columns
  end = start   columns

  # Initialize a streak and record variables
  # The low and high are the the 2 indicies we are looking for
  streak = -1
  streak_low = 0
  streak_high = 0

  record = 0
  record_low = 0
  record_high = 0

  # Loop trough the row
  for j, element in enumerate(input[start:end]):
    #print(i, j, element)
    # If the element is 1, update the streak
    if element == "1":
      streak  = 1
      streak_high = j
      # Update lower streak boundary if this is the first in the streak
      if streak == 0:
        streak_low = j

      # If the streak is greater then the current record, update the record
      if streak > record:
        record = streak
        record_low = streak_low #   1     Normally with indicies you start to count at 0
        record_high = streak_high #   1   if you want to start the count at 1 add the plus 1 here

    # If element is not 1 (thus NA) set the streak back to -1
    else:
      streak = -1
      streak_low = 0
      streak_high = 0
    
  # Print the row number and start / end index
  print ("{:<8} {:<15} {:<10}".format(f"row_{i 1}", record_low, record_high))

CodePudding user response:

An approach using rle

outp <- setNames(data.frame(t(apply(d, 1, function(x){
  rl = rle(unlist(x))$lengths
  rbind(which.max(rl) * any(rl > 1), (which.max(rl)   max(rl) - 1) * any(rl > 1))
}))), c("start_T", "end_T"))

cbind(row = paste0("row_", 1:nrow(outp)), outp)
    row start_T end_T
1 row_1       1     2
2 row_2       3     5
3 row_3       0     0
4 row_4       0     0

Data

d <- structure(list(T1 = list(1L, NA, NA, NA), T2 = list(1L, NA, 1L,
    NA), T3 = list(NA, 1L, NA, NA), T4 = list(NA, 1L, NA, NA),
    T5 = list(1L, 1L, NA, 1L)), class = "data.frame", row.names = c(NA,
-4L))
  • Related