Is there any way to organize my DF in the format i need using python or R?-CodePudding

I'm struggling to find a way to organize my data frame in the format I need. The database I'm working with provide me a data frame as it shows in the image:

Unorganized data frame

the thing is there's some date on the column ID and in the NF column there's characters. I'm trying a way to organize the dataframe as it follows: Organized Data frame

Where the column ID have only the ID's and in the NF column I have the ID's related to the respective NF.

I was thinking in a way to extract the ID's between the date format who are in the column and complete the NF column with the NF in the column who are in the same row of the date.

I'm currently doing it manually in spreadsheets but if it increase too much will be impossible and my coding skills are bellow the challenge at the moment.

CodePudding user response：

You could do something like this.

The mutate makes anything in the nf column NA unless it starts with 3 digits and a hyphen. You could adjust that regex pattern if needed for your larger dataset. fill then copies down the nf values you want. And the filter removes the rows containing dates.

(You could use %>% instead of |> if not on a recent version of R.)

library(tidyverse)

df <- tribble(
  ~id, ~nf,
  "27/07/2021", "001-00011598",
  "5683", "BASE P/ NOTEBOOK NBC-50BK C3TECH",
  "10180", "SUPORTE MONITOR 17 A 32 T1224N ELG",
  "10208", "SUPORTE F160N DUPLO P/ MONITOR 17 A 27 ELG",
  "12551", "MINI ADAPTADOR XC-BTT-04 XCELL",
  "13121", "GABINETE PINK TGC-PO3P T-DAGGER",
  "13306", "CASE HD   SSD CGHD-G34 EXBOM",
  "27/07/2021", "001-00011599",
  "12551", "MINI ADAPTADOR XC-BTT-04 XCELL",
  "12674", "SUPORTE FIXO GENIUS ELG",
  "13306", "CASE HD   SSD CGHD-G34 EXBOM"
)

df2 <- df |> 
  mutate(nf = if_else(!str_starts(nf, "\\d{3}-"), NA_character_, nf)) |> 
  fill(nf) |> 
  filter(!str_detect(id, "\\d{2}/\\d{2}/\\d{4}"))

df2

#> # A tibble: 9 × 2
#>   id    nf          
#>   <chr> <chr>       
#> 1 5683  001-00011598
#> 2 10180 001-00011598
#> 3 10208 001-00011598
#> 4 12551 001-00011598
#> 5 13121 001-00011598
#> 6 13306 001-00011598
#> 7 12551 001-00011599
#> 8 12674 001-00011599
#> 9 13306 001-00011599

^{Created on 2022-07-01 by the reprex package (v2.0.1)}

CodePudding user response：

Here is a tidyverse approach:

Within an ifelse statement identify the pattern xxx-xxxxxxx all numbers.
fill the NA down with the identified pattern
filter out those with \ in the string from ID

library(dplyr)
library(stringr)
library(tidyr)

df %>% 
  mutate(NF = ifelse(str_detect(NF, '\\d \\-\\d '), NF, NA_character_)) %>% 
  fill(NF, .direction = "down") %>% 
  filter(!str_detect(ID, '\\/'))

      ID           NF
1   5683 001-00011598
2  10180 001-00011598
3  10208 001-00011598
4  12551 001-00011598
5  13121 001-00011598
6  13306 001-00011598
7  12551 001-00011599
8  12674 001-00011599
9  13306 001-00011599
10 13306 001-00011600

data:

structure(list(ID = c("27/07/2021", "5683", "10180", "10208", 
"12551", "13121", "13306", "27/07/2021", "12551", "12674", "13306", 
"27/07/2021", "13306"), NF = c("001-00011598", "BASE P/ NOTEBOOK NBC-50BK C3TECH", 
"SUPORTE MONITOR 17 A 32 T1224N ELG", "SUPORTE F160N DUPLO P/ MONITOR 17 A 27 ELG", 
"MINI ADAPTADOR XC-BTT-04 XCELL", "GABINETE PINK TGC-P03P T-DAGGER", 
"CASE HD   SSD CGHD-G34 EXBOM", "001-00011599", "MINI ADAPTADOR XC-BTT-04 XCELL", 
"SUPORTE FIXO GENIUS ELG", "CASE HD   SSD CGHD-G34 EXBOM", "001-00011600", 
"CASE HD   SSD CGHD-G34 EXBOM")), class = "data.frame", row.names = c(NA, 
-13L))