First, apologies if this sounds too basic. I have the following semi-structured data in text format, I need to parse these into a structured format: example:
Name
Alex
Address
14 high street
London
Color
blue
red
Name
Bob
Color
black
**Note that Alex has two colors, while Bob does not have an address. **
I want something that looks like this:
I think the right way is using regular expressions, but I'm struggling to split the text properly since some fields may be missing. What's a proper clean way to do this?
text='Name\nAlex\n\nAddress\n14 high street\nLondon\n\nColor\nblue\nred\n\nName\nBob\nColor\nblack'
profiles=re.split('(Name\n)', text, flags=re.IGNORECASE)
for profile in profiles:
#get name
name=re.split('(Name\n)|(Address\n)|(Color\n)', profile.strip(), flags=re.IGNORECASE)[0]
print(name)
#get address
#get color
CodePudding user response:
Try:
s = """\
Name
Alex
Address
14 high street
London
Color
blue
red
Name
Bob
Color
black"""
import pandas as pd
from itertools import groupby
colnames = ["Name", "Address", "Color"]
col1, col2 = [], []
for k, g in groupby(
(l for l in s.splitlines() if l.strip()), lambda l: l in colnames
):
(col2, col1)[k].append(" ".join(g))
df = pd.DataFrame({"col1": col1, "col2": col2})
df = df.assign(col3=df.col1.eq("Name").cumsum()).pivot(
index="col3", columns="col1", values="col2"
)
df.index.name, df.columns.name = None, None
df["Color"] = df["Color"].str.split()
df = df.explode("Color").fillna("")
print(df[colnames])
Prints:
Name Address Color
1 Alex 14 high street London blue
1 Alex 14 high street London red
2 Bob black
CodePudding user response:
Here's a vanilla python way to tackle the problem (rather than using pandas)
- Load your data (you'll probably read the data from your file, but we'll use this placeholder instead)
s = """
Name
Alex
Address
14 high street
London
Color
blue
red
Name
Bob
Color
black
"""
- Split up each entry by lines with
'Name'
entities = [e for e in s.split('Name')]
produces
[
'\n',
'\nAlex\n\nAddress\n14 high street\nLondon\n\nColor\nblue\nred\n\n',
'\n\nBob\nColor\nblack\n'
]
- Replace the newlines with spaces, then clean up duplicate spaces
entities = [e.replace('\n',' ').replace(' ',' ') for e in entities]
produces
[
' ',
' Alex Address 14 high street London Color blue red ',
' Bob Color black '
]
- Split each entry by space and toss any empty list entries
entities = [
[x for x in e.split(' ') if x != '']
for e in entities
]
produces
[
[],
['Alex', 'Address', '14', 'high', 'street', 'London', 'Color', 'blue', 'red'],
['Bob', 'Color', 'black']
]
- Get rid of any empty entities
entities = [e for e in entities if len(e) > 0]
produces
[
['Alex', 'Address', '14', 'high', 'street', 'London', 'Color', 'blue', 'red'],
['Bob', 'Color', 'black']
]
- Establish our tokens, then for the index of each token, capture the list elements that appear before the next token
# We'll store our findings here.
# We'll use the name, which is the first element of our 'entity' list,
# as the key for this dict.
properties = {}
for entity in entities:
# We'll figure out where each token shows up in our list
token_indices = []
for token in tokens:
# we could use
# token_indices.append(entity.index(token))
# if we knew that each token would only show up once
token_indices = [i for i,t in enumerate(entity) if t==token]
# now we'll sort the list of indices so that we can be sure
# we're dealing with them in order
token_indices = sorted(token_indices)
# since we haven't seen this person before, we'll establish a
# dict for their properties.
individual_properties = {}
for k,_ in enumerate(token_indices):
this_tkn_name = entity[token_indices[k]]
this_tkn_idx = token_indices[k]
next_tkn_idx = token_indices[k 1]
# We'll iterate over each token's index
if k 1 < len(token_indices):
# this isn't the last token
individual_properties[
this_tkn_name
] = ' '.join(entity[this_tkn_idx 1:next_tkn_idx])
else:
# this is the last token
individual_properties[
this_tkn_name
] = ' '.join(entity[this_tkn_idx 1:])
# the first element in the entity list is their name, so we can
# find that with entity[0]
properties[entity[0]] = individual_properties
produces
{
'Alex': {
'Address': '14 high street London',
'Color': 'blue red'
},
'Bob': {
'Color': 'black'
}
}
NOTE: Depending on what you want to do with this, you may need further processing. Maybe you know what the colors are a list so you could use split(' ')
to get a list instead of a single string.