I have a data that I extracted from a pdf to text, and it is in this format :
text =
1
Address line 1
Address line 2
Zipe Code
Phone number
ID number
Date
2
Address line 1
Address line 2
Zipe Code
Phone number
ID number
Date
3
Address line 1
Address line 2
Zipe Code
Phone number
ID number
Date
....
The number 1,2,3 are supposed to be indexes. Now I want to write a loop that put in a data frame for each index, the information that follows it. So the result would be a table like :
index | Address | ZipCode | PhoneNumber | IdNumber | Date |
---|
I started writing the code but im stuck at the regex part.. How to put a variable to iterate in the look ahead or after part ? Any solution ?
import re
import pandas as pd
indexes = re.findall(r'(?<=\n)\d{1}(?=\n)', text)
# convert string to integer
for i in range(0, len(indexes)):
indexes[i] = int(indexes[i])
# extract the data
text
indexes
data = {}
for index in indexes:
next_index = index 1
index_value = re.search(r'(?<={index}).*(?={next_index})', text).group()
data[index] = index_value
Thanks!
CodePudding user response:
The slippery slope of regex: if you think you are going to solve your problem with regex, you may soon have 2 problems.
You data is much easier to parse. Each record occupies 7 lines and you already know which line contains which piece of the data:
first_index = None
data = []
current = {}
for i, line in enumerate(text.split("\n")):
if first_index is None:
first_index = i if re.match("\d ", line) else None
if first_index is None or line == "":
continue
delta = (i - first_index) % 7
if delta == 0:
current["index"] = int(line)
elif delta == 1:
current["Address"] = line
elif delta == 2:
current["Address"] = f" {line}"
elif delta == 3:
current["ZipCode"] = line
elif delta == 4:
current["PhoneNumber"] = line
elif delta == 5:
current["IdNumber"] = line
elif delta == 6:
current["Date"] = line
data.append(current)
current = {}
df = pd.DataFrame(data)
A more panda-y solution, once again relying on the 7-row-per-record structure of your text:
from io import StringIO
col_names = ["index", "Address1", "Address2", "ZipCode", "PhoneNumber", "IdNumber", "Date"]
df = (
pd.read_csv(StringIO(text), header=None)
# pd.read_csv("data.txt", header=None) # alternative: read it directly from the file
.assign(a=lambda x: x.index // 7, b=lambda x: x.index % 7)
.set_index(["a", "b"])
.unstack()
.set_axis(col_names, axis=1)
.rename_axis(None)
)
a
is the record number (similar idea to index
), b
is the position of the line within the record.
The downside of this approach is that all columns are strings and you must manually convert them to the appropriate datatype.
CodePudding user response:
Combine with string formatting:
rf'(?<={index}).*(?={next_index})'
>>> import re
>>> index=4
>>> next_index=5
>>> re.search(rf'(?<={index}).*(?={next_index})', '...4something5...')
<re.Match object; span=(4, 13), match='something'>
If you end up having to use the braces in your pattern, double them up:
>>> a="a"
>>> re.search(rf'{a}{{3}}', 'aaaa')
<re.Match object; span=(0, 3), match='aaa'>
CodePudding user response:
Use:
s = pd.read_csv('file.txt', header=None)
is_digit = s[0].str.isdigit()
index = (is_digit & (~is_digit.shift(-1, fill_value=False))).cumsum()
columns = index.groupby(index).cumcount()
new_df = \
s.pivot_table(index=index,
columns=columns,
aggfunc='first', values=0)\
.set_axis(['index', 'Address 1', 'Address 2',
'ZipeCode', 'Phone Number',
'ID Number', 'Date'], axis=1)
new_df = new_df.assign(Address=new_df['Address 1'].str.cat(new_df['Address 2'], ' '))\
.drop(['Address 1', 'Address 2'], axis=1)
print(new_df)
index ZipeCode Phone Number ID Number Date \
0
1 1 Zipe Code Phone number ID number Date
2 2 Zipe Code Phone number ID number Date
3 3 Zipe Code Phone number ID number Date
Address
0
1 Address line 1 Address line 2
2 Address line 1 Address line 2
3 Address line 1 Address line 2