How to use a variable in look ahead/after regex in python?-CodePudding

I have a data that I extracted from a pdf to text, and it is in this format :

text =
1
Address line 1
Address line 2
Zipe Code
Phone number
ID number
Date
2
Address line 1
Address line 2
Zipe Code
Phone number
ID number
Date
3
Address line 1
Address line 2
Zipe Code
Phone number
ID number
Date
....

The number 1,2,3 are supposed to be indexes. Now I want to write a loop that put in a data frame for each index, the information that follows it. So the result would be a table like :

index	Address	ZipCode	PhoneNumber	IdNumber	Date

I started writing the code but im stuck at the regex part.. How to put a variable to iterate in the look ahead or after part ? Any solution ?

import re
import pandas as pd

indexes = re.findall(r'(?<=\n)\d{1}(?=\n)', text)
# convert string to integer
for i in range(0, len(indexes)):
    indexes[i] = int(indexes[i])

# extract the data
text
indexes
data = {}
for index in indexes: 
    next_index = index 1 
    index_value = re.search(r'(?<={index}).*(?={next_index})', text).group()
    data[index] = index_value

Thanks!

CodePudding user response：

The slippery slope of regex: if you think you are going to solve your problem with regex, you may soon have 2 problems.

You data is much easier to parse. Each record occupies 7 lines and you already know which line contains which piece of the data:

first_index = None
data = []
current = {}

for i, line in enumerate(text.split("\n")):
    if first_index is None:
        first_index = i if re.match("\d ", line) else None

    if first_index is None or line == "":
        continue

    delta = (i - first_index) % 7
    if delta == 0:
        current["index"] = int(line)
    elif delta == 1:
        current["Address"] = line
    elif delta == 2:
        current["Address"]  = f" {line}"
    elif delta == 3:
        current["ZipCode"] = line
    elif delta == 4:
        current["PhoneNumber"] = line
    elif delta == 5:
        current["IdNumber"] = line
    elif delta == 6:
        current["Date"] = line
        data.append(current)
        current = {}

df = pd.DataFrame(data)

A more panda-y solution, once again relying on the 7-row-per-record structure of your text:

from io import StringIO

col_names = ["index", "Address1", "Address2", "ZipCode", "PhoneNumber", "IdNumber", "Date"]
df = (
    pd.read_csv(StringIO(text), header=None)
    # pd.read_csv("data.txt", header=None)   # alternative: read it directly from the file
    .assign(a=lambda x: x.index // 7, b=lambda x: x.index % 7)
    .set_index(["a", "b"])
    .unstack()
    .set_axis(col_names, axis=1)
    .rename_axis(None)
)

a is the record number (similar idea to index), b is the position of the line within the record.

The downside of this approach is that all columns are strings and you must manually convert them to the appropriate datatype.

CodePudding user response：

Combine with string formatting:

rf'(?<={index}).*(?={next_index})'

>>> import re
>>> index=4
>>> next_index=5
>>> re.search(rf'(?<={index}).*(?={next_index})', '...4something5...')
<re.Match object; span=(4, 13), match='something'>

If you end up having to use the braces in your pattern, double them up:

>>> a="a"
>>> re.search(rf'{a}{{3}}', 'aaaa')
<re.Match object; span=(0, 3), match='aaa'>

CodePudding user response：

Use:

s = pd.read_csv('file.txt', header=None)
is_digit = s[0].str.isdigit()
index = (is_digit & (~is_digit.shift(-1, fill_value=False))).cumsum()
columns = index.groupby(index).cumcount()

new_df = \
s.pivot_table(index=index, 
              columns=columns, 
              aggfunc='first', values=0)\
.set_axis(['index', 'Address 1', 'Address 2',
           'ZipeCode', 'Phone Number', 
           'ID Number', 'Date'], axis=1)

new_df = new_df.assign(Address=new_df['Address 1'].str.cat(new_df['Address 2'], ' '))\
    .drop(['Address 1', 'Address 2'], axis=1)
print(new_df)

  index   ZipeCode  Phone Number  ID Number  Date  \
0                                                   
1     1  Zipe Code  Phone number  ID number  Date   
2     2  Zipe Code  Phone number  ID number  Date   
3     3  Zipe Code  Phone number  ID number  Date  

                         Address  
0                                 
1  Address line 1 Address line 2  
2  Address line 1 Address line 2  
3  Address line 1 Address line 2