Home > Software design >  How best to parse fields in R?
How best to parse fields in R?

Time:09-09

Below is the sample data. This is how it comes from the current population survey. There are 115 columns in the original. Below is just a subset. At the moment, I simply append a new row each month and leave it as is. However, there has been a new request that it be made longer and parsed a bit.

For some context, the first character is the race, a = all, b=black, w=white, and h= hispanic. The second character is the gender, x = all, m = male, and f= female. The third variable, which does not appear in all columns is the age. These values are 2024 for ages 20-24, 3039 or 30-39, and so on. Each one will end in the terms, laborforce unemp or unemprate.

 stfips <- c(32,32,32,32,32,32,32,32)
 areatype <- c(01,01,01,01,01,01,01,01)
 periodyear <- c(2021,2021,2021,2021,2021,2021,2021,2021)
 period <- (01,02,03,04,05,06,07,08)
 xalaborforce <- c(1210.9,1215.3,1200.6,1201.6,1202.8,1209.3,1199.2,1198.9)
 xaunemp <- c(55.7,55.2,65.2,321.2,77.8,88.5,92.4,102.6)
 xaunemprate <- c(2.3,2.5,2.7,2.9,3.2,6.5,6.0,12.5)
 walaborforce <- c(1000.0,999.2,1000.5,1001.5,998.7,994.5,999.2,1002.8)
 waunemp <- c(50.2,49.5,51.6,251.2,59.9,80.9,89.8,77.8)
 waunemprate <- c(3.4,3.6,3.8,4.0,4.2,4.5,4.1,2.6)
 balaborforce <- c (5.5,5.7,5.2,6.8,9.2,2.5,3.5,4.5)
 ba2024laborforce <- c(1.2,1.4,1.2,1.3,1.6,1.7,1.4,1.5)
 ba2024unemp <- c(.2,.3,.2,.3,.4,.5,.02,.19))
 ba2024lunemprate <- c(2.1,2.2,3.2,3.2,3.3,3.4,1.2,2.5)

 test2 <- data.frame (stfips,areatype,periodyear, period, xalaborforce,xaunemp,xaunemprate,walaborforce, waunemp,waunemprate,balaborforce,ba2024laborforce,ba2024unemp,ba2024unemprate)

 Desired result

 stfips    areatype   periodyear   period race   gender    age   laborforce    unemp    unemprate   

   32         01        2021           01    x       a      all    1210.9      55.7       2.3
   32         01        2021           02    x       a      all    1215.3      55.2       2.5

  .....(the other six rows for race = x and gender = a
   32        01         2021           01    w      a       all    1000.0      50.2       3.4
   32        01         2021           02    w      a       all    999.2       49.5       3.6
  ....(the other six rows for race = w and gender = a
   32        01         2021           01    b      a       2024   1.2         .2     2.1

CodePudding user response:

Edit -- added handling for columns with age prefix. Mostly there, but would be nice to have a concise way to add the - to make 2024 into 20-24....

test2 %>%
    pivot_longer(xalaborforce:ba2024laborforce) %>%
    separate(name, c("race", "gender", "stat"), sep = c(1,2)) %>% 
    mutate(age = coalesce(parse_number(stat) %>% as.character, "all"),
           stat = str_remove_all(stat, "[0-9]")) %>%
    pivot_wider(names_from = stat, values_from = value)
  

# A tibble: 32 × 10
   stfips areatype periodyear period race  gender age   laborforce unemp unemprate
    <dbl>    <dbl>      <dbl>  <dbl> <chr> <chr>  <chr>      <dbl> <dbl>     <dbl>
 1     32        1       2021      1 x     a      all       1211.   55.7       2.3
 2     32        1       2021      1 w     a      all       1000    50.2       3.4
 3     32        1       2021      1 b     a      all          5.5  NA        NA  
 4     32        1       2021      1 b     a      2024         1.2  NA        NA  
 5     32        1       2021      2 x     a      all       1215.   55.2       2.5
 6     32        1       2021      2 w     a      all        999.   49.5       3.6
 7     32        1       2021      2 b     a      all          5.7  NA        NA  
 8     32        1       2021      2 b     a      2024         1.4  NA        NA  
 9     32        1       2021      3 x     a      all       1201.   65.2       2.7
10     32        1       2021      3 w     a      all       1000.   51.6       3.8
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows
  • Related