I have multiple .txt from the USPS which need to be parsed.
The .txt file is like this:
C COPYRIGHT(C) 09-22 USPS 001 D91801Z221103711AFC049 HOWARD ST 00000005280000000528EELMHURST GUEST HOME B33983398B 050108CA03727 Z20050D91801Z221103713AFC027E MAIN ST 00000006300000000630EBROAD SOLUTIONS B40644064B 050108CA03727 Z20050D91801Z221103716AFC006E MAIN ST 00000010010000001001ODR FORTNER/FORTIER PROF. B41994199B 050108CA03727 Z20050D91801Z221103721AFC050N 4TH ST 00000001280000000128EMY LADIES GUEST HOME B34903490B 050108CA03727 Z20050D91801Z221103723AFC044N GARFIELD AVE 00000006240000000624BSCHILLINGS B14681468B 050108CA03727 Z20050D91801Z221103724AFC044N GARFIELD AVE 00000007120000000712BNAVARRO CONST CO B14971497B 050108CA03727 Z20050D91801Z221103725AFC044N GARFIELD AVE 00000004120000000412EHOLY TRINITY CHURCH B24982498B 050108CA03727 Z20050D91801Z221103726AFC044N GARFIELD AVE 00000004200000000420ERUSSELL MAIORANA DDS B24972497B 050108CA03727 Z20050D91801Z221103729AFC049N OLIVE AVE 00000000210000000021OBETHANY CHURCH B33863386B 050108CA03727 Z20050D91801Z221103731AFC061S 1ST ST 00000001110000000111OALHAMBRA CITY HALL B37963796B 050108CA03727 Z20050D91801Z221103734AFC036S ALMANSOR ST 00000008400000000840EFIRST LUTHERAN CHURCH B45994599B 050108CA03727 Z20050D91801Z221103735AFC023S ATLANTIC BLVD 00000002140000000214EATHERTON BAPTIST HOMES B32983298B 050108CA03727
I have tried reading the .txt as:
df = read.table("918.txt", sep = "", header = F)
This method kind of works but all the records are read in one row (only one observation with many columns). How can I have a new row whenever a condition is met. The information after Z should be a new row. Each row or record contains 182 characters (spaces included).
In a sense it should read like so:
Z20050D91801Z221103725AFC044N GARFIELD AVE 00000004120000000412EHOLY TRINITY CHURCH B24982498B 050108CA03727
Z20050D91801Z221103726AFC044N GARFIELD AVE 00000004200000000420ERUSSELL MAIORANA DDS B24972497B 050108CA03727
Z20050D91801Z221103729AFC049N OLIVE AVE 00000000210000000021OBETHANY CHURCH B33863386B 050108CA03727
Also, is there a way to parse certain fields even more? For example, the first portion of each address starts with Z followed by a bunch of characters. I'd like to separate those further.
Thank you all for your help!
CodePudding user response:
If the records should all start with Z
, have any number of digits and letters, and ends on another letter (which appears to be one of the cardinal directions, but I'm not assuming that), then we can do:
# quux <- readLines("path/to/file.txt")
quux2 <- strsplit(gsub("\\b([DZ][0-9][A-Z0-9] )\\b", "\n\\1", quux), "\n")[[1]]
quux2
# [1] "C COPYRIGHT(C) 09-22 USPS 001 "
# [2] "D91801Z221103711AFC049 HOWARD ST 00000005280000000528EELMHURST GUEST HOME B33983398B 050108CA03727 "
# [3] "Z20050D91801Z221103713AFC027E MAIN ST 00000006300000000630EBROAD SOLUTIONS B40644064B 050108CA03727 "
# [4] "Z20050D91801Z221103716AFC006E MAIN ST 00000010010000001001ODR FORTNER/FORTIER PROF. B41994199B 050108CA03727 "
# [5] "Z20050D91801Z221103721AFC050N 4TH ST 00000001280000000128EMY LADIES GUEST HOME B34903490B 050108CA03727 "
# [6] "Z20050D91801Z221103723AFC044N GARFIELD AVE 00000006240000000624BSCHILLINGS B14681468B 050108CA03727 "
# [7] "Z20050D91801Z221103724AFC044N GARFIELD AVE 00000007120000000712BNAVARRO CONST CO B14971497B 050108CA03727 "
# [8] "Z20050D91801Z221103725AFC044N GARFIELD AVE 00000004120000000412EHOLY TRINITY CHURCH B24982498B 050108CA03727 "
# [9] "Z20050D91801Z221103726AFC044N GARFIELD AVE 00000004200000000420ERUSSELL MAIORANA DDS B24972497B 050108CA03727 "
# [10] "Z20050D91801Z221103729AFC049N OLIVE AVE 00000000210000000021OBETHANY CHURCH B33863386B 050108CA03727 "
# [11] "Z20050D91801Z221103731AFC061S 1ST ST 00000001110000000111OALHAMBRA CITY HALL B37963796B 050108CA03727 "
# [12] "Z20050D91801Z221103734AFC036S ALMANSOR ST 00000008400000000840EFIRST LUTHERAN CHURCH B45994599B 050108CA03727 "
# [13] "Z20050D91801Z221103735AFC023S ATLANTIC BLVD 00000002140000000214EATHERTON BAPTIST HOMES B32983298B 050108CA03727 "
Assuming we want to keep the (first) D
-leading line (where its field is shorter than the Z
-lines), then we can buffer it manually. Skip the padding if you don't want this, and then omit this line when reading the fixed-width table below (since this one won't align properly):
nchars <- nchar(sub(" .*", "", quux2))
nchars
# [1] 1 22 29 29 29 29 29 29 29 29 29 29 29
quux2[2] <- paste0(strrep(" ", nchars[3] - nchars[2] - 1), quux2[2])
Now reading in the fixed-width table:
quux3 <- read.fwf(textConnection(quux2[-1]), widths=c(30, 34, 10, 11, 60, 12, 6, 2, 5), colClasses = "character")
quux3[] <- lapply(quux3, trimws)
# optional step to reduce space in addresses
quux3[] <- lapply(quux3, gsub, pattern = "\\s ", replacement = " ")
quux3
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1 D91801Z221103711AFC049 HOWARD ST 0000000528 0000000528E ELMHURST GUEST HOME B33983398B 050108 CA 03727
# 2 Z20050D91801Z221103713AFC027E MAIN ST 0000000630 0000000630E BROAD SOLUTIONS B40644064B 050108 CA 03727
# 3 Z20050D91801Z221103716AFC006E MAIN ST 0000001001 0000001001O DR FORTNER/FORTIER PROF. B41994199B 050108 CA 03727
# 4 Z20050D91801Z221103721AFC050N 4TH ST 0000000128 0000000128E MY LADIES GUEST HOME B34903490B 050108 CA 03727
# 5 Z20050D91801Z221103723AFC044N GARFIELD AVE 0000000624 0000000624B SCHILLINGS B14681468B 050108 CA 03727
# 6 Z20050D91801Z221103724AFC044N GARFIELD AVE 0000000712 0000000712B NAVARRO CONST CO B14971497B 050108 CA 03727
# 7 Z20050D91801Z221103725AFC044N GARFIELD AVE 0000000412 0000000412E HOLY TRINITY CHURCH B24982498B 050108 CA 03727
# 8 Z20050D91801Z221103726AFC044N GARFIELD AVE 0000000420 0000000420E RUSSELL MAIORANA DDS B24972497B 050108 CA 03727
# 9 Z20050D91801Z221103729AFC049N OLIVE AVE 0000000021 0000000021O BETHANY CHURCH B33863386B 050108 CA 03727
# 10 Z20050D91801Z221103731AFC061S 1ST ST 0000000111 0000000111O ALHAMBRA CITY HALL B37963796B 050108 CA 03727
# 11 Z20050D91801Z221103734AFC036S ALMANSOR ST 0000000840 0000000840E FIRST LUTHERAN CHURCH B45994599B 050108 CA 03727
# 12 Z20050D91801Z221103735AFC023S ATLANTIC BLVD 0000000214 0000000214E ATHERTON BAPTIST HOMES B32983298B 050108 CA 03727
Data
quux <- "C COPYRIGHT(C) 09-22 USPS 001 D91801Z221103711AFC049 HOWARD ST 00000005280000000528EELMHURST GUEST HOME B33983398B 050108CA03727 Z20050D91801Z221103713AFC027E MAIN ST 00000006300000000630EBROAD SOLUTIONS B40644064B 050108CA03727 Z20050D91801Z221103716AFC006E MAIN ST 00000010010000001001ODR FORTNER/FORTIER PROF. B41994199B 050108CA03727 Z20050D91801Z221103721AFC050N 4TH ST 00000001280000000128EMY LADIES GUEST HOME B34903490B 050108CA03727 Z20050D91801Z221103723AFC044N GARFIELD AVE 00000006240000000624BSCHILLINGS B14681468B 050108CA03727 Z20050D91801Z221103724AFC044N GARFIELD AVE 00000007120000000712BNAVARRO CONST CO B14971497B 050108CA03727 Z20050D91801Z221103725AFC044N GARFIELD AVE 00000004120000000412EHOLY TRINITY CHURCH B24982498B 050108CA03727 Z20050D91801Z221103726AFC044N GARFIELD AVE 00000004200000000420ERUSSELL MAIORANA DDS B24972497B 050108CA03727 Z20050D91801Z221103729AFC049N OLIVE AVE 00000000210000000021OBETHANY CHURCH B33863386B 050108CA03727 Z20050D91801Z221103731AFC061S 1ST ST 00000001110000000111OALHAMBRA CITY HALL B37963796B 050108CA03727 Z20050D91801Z221103734AFC036S ALMANSOR ST 00000008400000000840EFIRST LUTHERAN CHURCH B45994599B 050108CA03727 Z20050D91801Z221103735AFC023S ATLANTIC BLVD 00000002140000000214EATHERTON BAPTIST HOMES B32983298B 050108CA03727 "
CodePudding user response:
qdapRegex::ex_between(t, "Z2","CA03727", extract = T)
if the pattern its not needed just the lines, go with it !