Home > OS >  Tidying same ordered data placed in different rows across different columns
Tidying same ordered data placed in different rows across different columns

Time:08-09

I have made a bulk import that results in df that ends up in a non-tidy fashion. The problem lies on the original databases, and on that end there is nothing I can do.

After import and rotate_df I end up with this

x   y   z
1   NA  NA
2   NA  NA
NA  1   NA   
NA  2   NA
NA  NA  1
NA  NA  2

Basically, instead a new file's data is being added stackedly, the data of each additional column (after rotate) only starts at 34 rows than the previous one.

Given I can not change the original databases using, is there any jedi trick that allows changing my current df to this?

x   y   z
1   1   1
2   2   2

As each collumn has the same information, it would allow for my df to be tidy.

Thank you.

CodePudding user response:

If you're getting vector mismatch errors, it suggests that some of your "correct" values are also NAs. These could be NULLs in your database, or values that get turned into NA by R because of a type mismatch: make sure you know what it happening and that you're not losing data in the import.

That said, this means you'll probably want to use the other piece of information you have - that the columns are staggered by a reproducible distance.

df <-
structure(list(x = c(1L, 2L, NA, NA, NA, NA), y = c(NA, NA, 1L, 
2L, NA, NA), z = c(NA, NA, NA, NA, 1L, 2L)), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L), class = "data.frame")

stagger <- 2 # this would be 34 in your real data

df_compact <- as.data.frame(lapply(1:ncol(df), function(i) 
    df[((i-1)*stagger 1) : (i*stagger), i]
))
colnames(df_compact) <- colnames(df)

 df_compact
  x y z
1 1 1 1
2 2 2 2

This will only work correctly if nrow(df) = stagger * ncol(df). If not, you will not get an error but all sorts of horrid imputed NAs, so be very careful and find some other way to validate your data afterwards.

CodePudding user response:

Another possible solution, based on purrr::map_dfc:

library(purrr)

map_dfc(df, na.omit)

#> # A tibble: 2 × 3
#>       x     y     z
#>   <int> <int> <int>
#> 1     1     1     1
#> 2     2     2     2

CodePudding user response:

You could try

as.data.frame(lapply(df, na.omit))

#   x y z
# 1 1 1 1
# 2 2 2 2

Update: It seems that all variables do not have the same length after NAs are removed. Take the following data as example:

x   y   z
1   NA  NA
2   NA  NA
NA  1   NA   
NA  2   NA
NA  NA  1
NA  NA  2
NA  NA  3

What's the expected output for it? Keep the 3 like

#  x  y  z
#  1  1  1
#  2  2  2
# NA AN  3

or discard it

#  x  y  z
#  1  1  1
#  2  2  2

or use other rules?

  • Related