Home > Mobile >  Erratic output when accessing cells in dataframe
Erratic output when accessing cells in dataframe

Time:05-16

I have a dataframe of character strings that includes NA's. Here is an altered subpart of it:

subdf
            Col1                Col2
1           <NA>                <NA>
2 Other Services                <NA>
3 Other Services                <NA>
4 Other Services Services of lawyers
5 Other Services                <NA>

I want to replace the NA's depending on the cell value to their left/right. I tried to do this the following way:

subdf$Col1[subdf$Col2=="Services of lawyers"]

[1] NA               NA              
[3] NA               "Other Services"
[5] NA

As apparen, I get erratic outputs when looking up the NA cell. This makes it impossible to adequately replace the appropriate NA value.

na.omit() is obviously not applicable, since I am expecting NA as output in order to replace it.

CodePudding user response:

TL;DR

You could use which around your logical test to remove the unexpected NA results of the subsetting operation:

subdf$Col1[which(subdf$Col2=="Services of lawyers")]

Explanation

I think we can replicate your issue like this. Suppose I have a data frame with no NA values:

df1 <- data.frame(x = c("A", "B", "C"), y = 1:3)

If we want to find the values of column y when x == "A", we do:

df1$y[df1$x == "A"]
#> [1] 1

This gives us the expected result. But look what happens when there are NA values in x:

df2 <- data.frame(x = c("A", "B", NA), y = 1:3)

What result would you expect now?

df2$y[df2$x == "A"]
#> [1]  1 NA

This might seem unexpected. After all, we only wanted the values of y when x was "A", but now we have a length-2 result, which neither matches the length of the data frame nor the number of "A"s in our data frame. Why?

It is because we are subsetting by the logical vector df2$x == "A", which is:

df2$x == "A"
#> [1]  TRUE FALSE    NA

So if we subset by this, we will get the first item selected, the second item omitted, but the third item isn't omitted. If you subset by NA, an NA is returned. That is why we get two items returned.

The simple way to suppress this is to wrap your logical test in which, since it will convert to numeric indices and quietly drop NA values:

df2$y[which(df2$x == "A")]
#> [1] 1

CodePudding user response:

You could try

library(dplyr)
table <- data.frame("Col1"=c(NA, "B", "C"), "Col2"=c("A'", "B'", "C'"))
table %>% 
  mutate(
  Col1 = ifelse(is.na(Col1), stringr::str_extract(Col2, "[A-Z] "), Col1)
)

Edit for new data:

df <- tibble::tribble(~Col1, ~Col2, 
        "<NA>", "<NA>",
        "Other Services", "<NA>",
        "Other Services", "<NA>",
        "<NA>", "Services of lawyers",
        "Other Services", "<NA>"    
        ) 

df%>% 
  mutate(
    Col1 = ifelse(Col1 == "<NA>", Col2, Col1)
  )
  • Related