Home > Software design >  I am new to coding in R and not able to create a loop for the problem
I am new to coding in R and not able to create a loop for the problem

Time:03-22

I have the following information:

Policy No      Amount     Info (1)    Info (2)     Info (3)     Info (4)   Info (5)
1001            100       500001      500002       500003       999999       999999
1002            200       600001      600002       999999       999999       999999
1003            300       700001      700002       700003       700004       700005

Above is how the information is provided.

This is how I want it to be represented:

Policy No      Amount       Info
1001             100        500001
1001             100        500002
1001             100        500003
1002             200        600001
1002             200        600002
1003             300        700001
1003             300        700002
1003             300        700003
1003             300        700004
1003             300        700005

Basically, I want to check each info column in row and if there is 999999 then go to the next row and produce the values up to the info column has 999999.

I have over 100,000 policy no and amount info like this.

Your help is highly appreciated.

CodePudding user response:

Pivot longer and then remove all 999999 values.

library(dplyr)

df %>%
    pivot_longer(cols = starts_with("Info"), names_to = "Info") %>%
    filter(!Info == 999999)

CodePudding user response:

Possible alternative to using a loop with tidyverse:

library(tidyverse)

df %>%
  mutate(across(everything(), ~na_if(.x, 999999))) %>%
  pivot_longer(-c(`Policy No`, `Amount`), names_to = NULL, values_drop_na = TRUE)

Output

   `Policy No` Amount  value
         <int>  <int>  <int>
 1        1001    100 500001
 2        1001    100 500002
 3        1001    100 500003
 4        1002    200 600001
 5        1002    200 600002
 6        1003    300 700001
 7        1003    300 700002
 8        1003    300 700003
 9        1003    300 700004
10        1003    300 700005

Or if you want to stick with the for loop, then you could do something like this:

output <- NULL

for (i in names(df[,-c(1:2)])) {
  x <- df[which(df[,i] != 999999), c("Policy No", "Amount",i)]
  names(x)[3] <- "Info"
  output <- rbind(output, x)
}

Data

df <- structure(list(`Policy No` = 1001:1003, Amount = c(100L, 200L, 
300L), `Info (1)` = c(500001L, 600001L, 700001L), `Info (2)` = c(500002L, 
600002L, 700002L), `Info (3)` = c(500003L, 999999L, 700003L), 
    `Info (4)` = c(999999L, 999999L, 700004L), `Info (5)` = c(999999L, 
    999999L, 700005L)), class = "data.frame", row.names = c(NA, 
-3L))

CodePudding user response:

There are certain circumstances where we need to rotate the data from a wide format to a long format. I had a similar issue like this in some of my code but in reverse. Here is the question I posted. This also provides other options that can be used to solve this problem.

R- How to put multiple cases (rows) in one row

I would use reshape. Here is a link to the r documentation.
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/reshape

Here is the data.

policyno <- c (1001, 1002, 1003)
amount <- c (100, 200, 300)
info1 <- c (500001, 600001, 700001)
info2 <- c (500002, 600002, 700002)
info3 <- c (500003, 999999, 700003)
info4 <- c (999999, 999999, 700004)
info5 <- c (999999, 999999, 700005)
data <- data.frame (policyno, amount, info1, info2, info3, info4, info5)
data

There are a few things we need to input for reshape to work. We need the name of the data frame (data), the id variable (policyno). We also need the name of the new variable we want to create (info) and a list of the variables which make up this new variable (info1, info2, info3, info4, info5). For generating this list of variables, there are a few ways to do this. I have seen posts where we list the row number. If we have a very small dataset and the variables (and their row numbers stay the same), then this method is fine. If you are working with a very large dataset where the row numbers might change depending upon how you are subsetting the data, then we can have R identify the row numbers for us. Which and calnames can be used for this purpose.

data2 <- reshape (data, 
                  idvar = "policyno", 
                  v.names = c ("info"),
                  varying = list (c (which(colnames(data) == "info1"),
                                    which (colnames(data) == "info2"),
                                    which(colnames(data) == "info3"),
                                    which(colnames(data) == "info4"),
                                    which(colnames(data) == "info5")
                                    )
                                  ),
                  direction = "long")
data2

This rotates the data but doesn't remove the 999999 cases. To remove the 999999, we can use subset. As I minor note, this reshape created a variable labeled time which we can remove using select.

data3 <- subset (data2, 
                 select = -c (time),
                 info != 999999)
data3

Again, this is just one way that can be used to solve the problem. Stackoverflow has a lot of posts about reshape and transpose that are very useful.

  • Related