Home > Software design >  Count values by loop detecting unique string in a dataframe in R
Count values by loop detecting unique string in a dataframe in R

Time:07-14

I have a dataframe called data_total that looks like that:

The dataframe is this Data Base.

Cuenca    Temporada Alcalinidad Carb_ Bicarb_ Fluoruros  Cloruros Nitratos Nitratos_N Sulfatos Nitritos Nitritos_N Fosfatos
  <fct>     <fct>           <dbl> <dbl>   <dbl> <chr>         <dbl> <chr>    <chr>         <dbl> <chr>    <chr>      <chr>   
1 Tumbes    Avenida          283.   0.3    346. 0.495         572.  <0.031   <0.007        901.  <0.003   <0.001     <0.019  
2 Zarumilla Avenida          146.   0.3    179. 0.2439999…    279.  <0.031   <0.007        274.  <0.003   <0.001     <0.019  
3 Zarumilla Avenida          108.   0.3    132. 0.314          62.8 2.22900… 0.504          61.0 <0.003   <0.001     <0.019  
4 Zarumilla Avenida          138.   0.3    168. 0.184         584.  <0.031   <0.007        151.  <0.003   <0.001     <0.019  
5 Tumbes    Avenida          156.   0.3    191. 0.2020000…     68.6 0.151    3.4000000…    124.  <0.003   <0.001     <0.019  
6 Tumbes    Avenida          211    0.3    257. 0.157          64.9 1.04499… 0.2359999…    146.  <0.003   <0.001     <0.019 ````

So I need to summarise how many values by column contain "<" symbol in the dataframe group by Cuenca, Temporada (in the future more factors) in all the others columns in the dataframe from Alcalinidad until the last column.

I know how to do it in one column, by example in Cd_dis column like this:

data_total %>% filter(str_detect(Cd_dis, "<")) %>% 
    group_by(Cuenca,Temporada) %>%
    summarise(non_detect = n(), completed= nrow(conteo)-n())

The output is:

  Cuenca            Temporada non_detect completed
  <fct>             <fct>          <int>     <int>
1 Intercuenca 13951 Avenida            4        95
2 Tumbes            Avenida           23        76
3 Zarumilla         Avenida           13        86

But I need to create a code that runs in all the others columns and join the dataframes obtained for each variable.

Finally, I am creating a package to automatically works with dataframes that contain non-detected values in geology and complete by using Hesel methodology from the United State Geological Survey, so who help with these I am going to mentioned in the package for acknowledgment. Thank so much!!!

CodePudding user response:

A subset of the data.

data_total <- structure(list(Cuenca = c("Tumbes", "Zarumilla", "Zarumilla", 
"Zarumilla", "Tumbes", "Tumbes", "Tumbes", "Tumbes", "Tumbes", 
"Tumbes", "Tumbes", "Tumbes", "Tumbes", "Tumbes", "Tumbes", "Tumbes", 
"Tumbes", "Tumbes", "Tumbes", "Zarumilla"), Temporada = c("Avenida", 
"Avenida", "Avenida", "Avenida", "Avenida", "Avenida", "Avenida", 
"Avenida", "Avenida", "Avenida", "Avenida", "Avenida", "Avenida", 
"Avenida", "Avenida", "Avenida", "Avenida", "Avenida", "Avenida", 
"Avenida"), Alcalinidad = c(283.2, 146.5, 108.3, 137.5, 156.3, 
211, 277.9, 311.4, 44.9, 44.7, 59.1, 41.9, 39, 243.8, 421.9, 
172.7, 93.2, 64.8, 325.8, 80.6), Carb_ = c(0.3, 0.3, 0.3, 0.3, 
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 
0.3, 0.3, 0.3), Bicarb_ = c(345.6, 178.7, 132.1, 167.7, 190.7, 
257.4, 339, 380, 54.7, 54.5, 72.1, 51.1, 47.6, 297.5, 514.7, 
210.7, 113.7, 79.1, 397.5, 98.3), Fluoruros = c(0.495, 0.244, 
0.314, 0.184, 0.202, 0.157, 0.17, 0.09, 0.046, 0.053, 0.047, 
0.068, 0.064, 0.347, 2.72, 0.466, 0.096, 0.09, 0.43, 0.233), 
    Cloruros = c(571.775, 279.242, 62.816, 583.6, 68.641, 64.948, 
    28.799, 19.77, 1.454, 1.51, 15.736, 2.394, 2.51, 209.913, 
    2855.5, 131.031, 32.365, 57.707, 662.67, 15.396), Nitratos = c("<0.031", 
    "<0.031", "2.229", "<0.031", "0.151", "1.045", "5.578", "0.567", 
    "1.768", "2.323", "1.899", "1.76", "2.088", "8.10E-02", "<0.031", 
    "2.656", "<0.031", "1.018", "1.439", "0.133"), Nitratos_N = c("<0.007", 
    "<0.007", "0.504", "<0.007", "3.40E-02", "0.236", "1.26", 
    "0.128", "0.4", "0.525", "0.429", "0.398", "0.472", "1.80E-02", 
    "<0.007", "0.6", "<0.007", "0.23", "0.325", "0.03"), Sulfatos = c(900.82, 
    274.02, 61.05, 151.11, 124.29, 145.78, 126.76, 19.41, 20.72, 
    18.59, 43.25, 23.35, 21.51, 281.15, 5144.7, 142.98, 31.21, 
    54.1, 1211.96, 7.36)), class = "data.frame", row.names = c(NA, 
-20L))

Tidyverse answer.

library("tidyr")
library("dplyr")

data_total |>
  pivot_longer(!1:2, values_transform=list(value=as.character)) |>
  group_by(across(1:3)) |>
  summarize(
    non_detected=sum(str_detect(value, "^<")),
    completed=nrow(data_total) - non_detected,
    .groups="drop")

Results.

# A tibble: 16 × 5
   Cuenca    Temporada name        non_detected completed
   <chr>     <chr>     <chr>              <int>     <int>
 1 Tumbes    Avenida   Alcalinidad            0        20
 2 Tumbes    Avenida   Bicarb_                0        20
 3 Tumbes    Avenida   Carb_                  0        20
 4 Tumbes    Avenida   Cloruros               0        20
 5 Tumbes    Avenida   Fluoruros              0        20
 6 Tumbes    Avenida   Nitratos               3        17
 7 Tumbes    Avenida   Nitratos_N             3        17
 8 Tumbes    Avenida   Sulfatos               0        20
 9 Zarumilla Avenida   Alcalinidad            0        20
10 Zarumilla Avenida   Bicarb_                0        20
11 Zarumilla Avenida   Carb_                  0        20
12 Zarumilla Avenida   Cloruros               0        20
13 Zarumilla Avenida   Fluoruros              0        20
14 Zarumilla Avenida   Nitratos               2        18
15 Zarumilla Avenida   Nitratos_N             2        18
16 Zarumilla Avenida   Sulfatos               0        20
  • Related