Home > Software engineering >  parse .txt file into multiple rows based on either number of characters or a specific start point
parse .txt file into multiple rows based on either number of characters or a specific start point

Time:09-15

I have multiple .txt from the USPS which need to be parsed.

The .txt file is like this:

C     COPYRIGHT(C) 09-22 USPS 001                                                                                                                                                     D91801Z221103711AFC049  HOWARD                      ST    00000005280000000528EELMHURST GUEST HOME                                         B33983398B  050108CA03727            Z20050D91801Z221103713AFC027E MAIN                        ST    00000006300000000630EBROAD SOLUTIONS                                             B40644064B  050108CA03727            Z20050D91801Z221103716AFC006E MAIN                        ST    00000010010000001001ODR FORTNER/FORTIER PROF.                                    B41994199B  050108CA03727            Z20050D91801Z221103721AFC050N    4TH                      ST    00000001280000000128EMY LADIES GUEST HOME                                        B34903490B  050108CA03727            Z20050D91801Z221103723AFC044N GARFIELD                    AVE   00000006240000000624BSCHILLINGS                                                  B14681468B  050108CA03727            Z20050D91801Z221103724AFC044N GARFIELD                    AVE   00000007120000000712BNAVARRO CONST CO                                            B14971497B  050108CA03727            Z20050D91801Z221103725AFC044N GARFIELD                    AVE   00000004120000000412EHOLY TRINITY CHURCH                                         B24982498B  050108CA03727            Z20050D91801Z221103726AFC044N GARFIELD                    AVE   00000004200000000420ERUSSELL MAIORANA DDS                                        B24972497B  050108CA03727            Z20050D91801Z221103729AFC049N OLIVE                       AVE   00000000210000000021OBETHANY CHURCH                                              B33863386B  050108CA03727            Z20050D91801Z221103731AFC061S    1ST                      ST    00000001110000000111OALHAMBRA CITY HALL                                          B37963796B  050108CA03727            Z20050D91801Z221103734AFC036S ALMANSOR                    ST    00000008400000000840EFIRST LUTHERAN CHURCH                                       B45994599B  050108CA03727            Z20050D91801Z221103735AFC023S ATLANTIC                    BLVD  00000002140000000214EATHERTON BAPTIST HOMES                                      B32983298B  050108CA03727            

I have tried reading the .txt as:

df = read.table("918.txt",  sep = "", header = F)

This method kind of works but all the records are read in one row (only one observation with many columns). How can I have a new row whenever a condition is met. The information after Z should be a new row. Each row or record contains 182 characters (spaces included).

In a sense it should read like so:

Z20050D91801Z221103725AFC044N GARFIELD                    AVE   00000004120000000412EHOLY TRINITY CHURCH                                         B24982498B  050108CA03727
Z20050D91801Z221103726AFC044N GARFIELD                    AVE   00000004200000000420ERUSSELL MAIORANA DDS                                        B24972497B  050108CA03727
Z20050D91801Z221103729AFC049N OLIVE                       AVE   00000000210000000021OBETHANY CHURCH                                              B33863386B  050108CA03727            

Also, is there a way to parse certain fields even more? For example, the first portion of each address starts with Z followed by a bunch of characters. I'd like to separate those further.

Thank you all for your help!

CodePudding user response:

If the records should all start with Z, have any number of digits and letters, and ends on another letter (which appears to be one of the cardinal directions, but I'm not assuming that), then we can do:

# quux <- readLines("path/to/file.txt")
quux2 <- strsplit(gsub("\\b([DZ][0-9][A-Z0-9] )\\b", "\n\\1", quux), "\n")[[1]]
quux2
#  [1] "C     COPYRIGHT(C) 09-22 USPS 001                                                                                                                                                     "
#  [2] "D91801Z221103711AFC049  HOWARD                      ST    00000005280000000528EELMHURST GUEST HOME                                         B33983398B  050108CA03727            "      
#  [3] "Z20050D91801Z221103713AFC027E MAIN                        ST    00000006300000000630EBROAD SOLUTIONS                                             B40644064B  050108CA03727            "
#  [4] "Z20050D91801Z221103716AFC006E MAIN                        ST    00000010010000001001ODR FORTNER/FORTIER PROF.                                    B41994199B  050108CA03727            "
#  [5] "Z20050D91801Z221103721AFC050N    4TH                      ST    00000001280000000128EMY LADIES GUEST HOME                                        B34903490B  050108CA03727            "
#  [6] "Z20050D91801Z221103723AFC044N GARFIELD                    AVE   00000006240000000624BSCHILLINGS                                                  B14681468B  050108CA03727            "
#  [7] "Z20050D91801Z221103724AFC044N GARFIELD                    AVE   00000007120000000712BNAVARRO CONST CO                                            B14971497B  050108CA03727            "
#  [8] "Z20050D91801Z221103725AFC044N GARFIELD                    AVE   00000004120000000412EHOLY TRINITY CHURCH                                         B24982498B  050108CA03727            "
#  [9] "Z20050D91801Z221103726AFC044N GARFIELD                    AVE   00000004200000000420ERUSSELL MAIORANA DDS                                        B24972497B  050108CA03727            "
# [10] "Z20050D91801Z221103729AFC049N OLIVE                       AVE   00000000210000000021OBETHANY CHURCH                                              B33863386B  050108CA03727            "
# [11] "Z20050D91801Z221103731AFC061S    1ST                      ST    00000001110000000111OALHAMBRA CITY HALL                                          B37963796B  050108CA03727            "
# [12] "Z20050D91801Z221103734AFC036S ALMANSOR                    ST    00000008400000000840EFIRST LUTHERAN CHURCH                                       B45994599B  050108CA03727            "
# [13] "Z20050D91801Z221103735AFC023S ATLANTIC                    BLVD  00000002140000000214EATHERTON BAPTIST HOMES                                      B32983298B  050108CA03727            "

Assuming we want to keep the (first) D-leading line (where its field is shorter than the Z-lines), then we can buffer it manually. Skip the padding if you don't want this, and then omit this line when reading the fixed-width table below (since this one won't align properly):

nchars <- nchar(sub(" .*", "", quux2))
nchars
#  [1]  1 22 29 29 29 29 29 29 29 29 29 29 29
quux2[2] <- paste0(strrep(" ", nchars[3] - nchars[2] - 1), quux2[2])

Now reading in the fixed-width table:

quux3 <- read.fwf(textConnection(quux2[-1]), widths=c(30, 34, 10, 11, 60, 12, 6, 2, 5), colClasses = "character")
quux3[] <- lapply(quux3, trimws)
# optional step to reduce space in addresses
quux3[] <- lapply(quux3, gsub, pattern = "\\s ", replacement = " ")
quux3
#                               V1            V2         V3          V4                       V5         V6     V7 V8    V9
# 1         D91801Z221103711AFC049     HOWARD ST 0000000528 0000000528E      ELMHURST GUEST HOME B33983398B 050108 CA 03727
# 2  Z20050D91801Z221103713AFC027E       MAIN ST 0000000630 0000000630E          BROAD SOLUTIONS B40644064B 050108 CA 03727
# 3  Z20050D91801Z221103716AFC006E       MAIN ST 0000001001 0000001001O DR FORTNER/FORTIER PROF. B41994199B 050108 CA 03727
# 4  Z20050D91801Z221103721AFC050N        4TH ST 0000000128 0000000128E     MY LADIES GUEST HOME B34903490B 050108 CA 03727
# 5  Z20050D91801Z221103723AFC044N  GARFIELD AVE 0000000624 0000000624B               SCHILLINGS B14681468B 050108 CA 03727
# 6  Z20050D91801Z221103724AFC044N  GARFIELD AVE 0000000712 0000000712B         NAVARRO CONST CO B14971497B 050108 CA 03727
# 7  Z20050D91801Z221103725AFC044N  GARFIELD AVE 0000000412 0000000412E      HOLY TRINITY CHURCH B24982498B 050108 CA 03727
# 8  Z20050D91801Z221103726AFC044N  GARFIELD AVE 0000000420 0000000420E     RUSSELL MAIORANA DDS B24972497B 050108 CA 03727
# 9  Z20050D91801Z221103729AFC049N     OLIVE AVE 0000000021 0000000021O           BETHANY CHURCH B33863386B 050108 CA 03727
# 10 Z20050D91801Z221103731AFC061S        1ST ST 0000000111 0000000111O       ALHAMBRA CITY HALL B37963796B 050108 CA 03727
# 11 Z20050D91801Z221103734AFC036S   ALMANSOR ST 0000000840 0000000840E    FIRST LUTHERAN CHURCH B45994599B 050108 CA 03727
# 12 Z20050D91801Z221103735AFC023S ATLANTIC BLVD 0000000214 0000000214E   ATHERTON BAPTIST HOMES B32983298B 050108 CA 03727

Data

quux <- "C     COPYRIGHT(C) 09-22 USPS 001                                                                                                                                                     D91801Z221103711AFC049  HOWARD                      ST    00000005280000000528EELMHURST GUEST HOME                                         B33983398B  050108CA03727            Z20050D91801Z221103713AFC027E MAIN                        ST    00000006300000000630EBROAD SOLUTIONS                                             B40644064B  050108CA03727            Z20050D91801Z221103716AFC006E MAIN                        ST    00000010010000001001ODR FORTNER/FORTIER PROF.                                    B41994199B  050108CA03727            Z20050D91801Z221103721AFC050N    4TH                      ST    00000001280000000128EMY LADIES GUEST HOME                                        B34903490B  050108CA03727            Z20050D91801Z221103723AFC044N GARFIELD                    AVE   00000006240000000624BSCHILLINGS                                                  B14681468B  050108CA03727            Z20050D91801Z221103724AFC044N GARFIELD                    AVE   00000007120000000712BNAVARRO CONST CO                                            B14971497B  050108CA03727            Z20050D91801Z221103725AFC044N GARFIELD                    AVE   00000004120000000412EHOLY TRINITY CHURCH                                         B24982498B  050108CA03727            Z20050D91801Z221103726AFC044N GARFIELD                    AVE   00000004200000000420ERUSSELL MAIORANA DDS                                        B24972497B  050108CA03727            Z20050D91801Z221103729AFC049N OLIVE                       AVE   00000000210000000021OBETHANY CHURCH                                              B33863386B  050108CA03727            Z20050D91801Z221103731AFC061S    1ST                      ST    00000001110000000111OALHAMBRA CITY HALL                                          B37963796B  050108CA03727            Z20050D91801Z221103734AFC036S ALMANSOR                    ST    00000008400000000840EFIRST LUTHERAN CHURCH                                       B45994599B  050108CA03727            Z20050D91801Z221103735AFC023S ATLANTIC                    BLVD  00000002140000000214EATHERTON BAPTIST HOMES                                      B32983298B  050108CA03727            "

CodePudding user response:

qdapRegex::ex_between(t, "Z2","CA03727", extract = T)

if the pattern its not needed just the lines, go with it !

  •  Tags:  
  • r txt
  • Related