How to convert text into structured data, taking into account missing fields, in Python?-CodePudding

First, apologies if this sounds too basic. I have the following semi-structured data in text format, I need to parse these into a structured format: example:

Name
Alex

Address
14 high street
London

Color
blue
red

Name

Bob
Color
black

**Note that Alex has two colors, while Bob does not have an address. **

I want something that looks like this:

example output

I think the right way is using regular expressions, but I'm struggling to split the text properly since some fields may be missing. What's a proper clean way to do this?

text='Name\nAlex\n\nAddress\n14 high street\nLondon\n\nColor\nblue\nred\n\nName\nBob\nColor\nblack'

profiles=re.split('(Name\n)', text, flags=re.IGNORECASE)
for profile in profiles:
    #get name
    name=re.split('(Name\n)|(Address\n)|(Color\n)', profile.strip(), flags=re.IGNORECASE)[0]
    print(name)
    
    #get address

    #get color

CodePudding user response：

Try:

s = """\
Name
Alex

Address
14 high street
London

Color
blue
red

Name

Bob
Color
black"""


import pandas as pd
from itertools import groupby

colnames = ["Name", "Address", "Color"]


col1, col2 = [], []
for k, g in groupby(
    (l for l in s.splitlines() if l.strip()), lambda l: l in colnames
):
    (col2, col1)[k].append(" ".join(g))

df = pd.DataFrame({"col1": col1, "col2": col2})
df = df.assign(col3=df.col1.eq("Name").cumsum()).pivot(
    index="col3", columns="col1", values="col2"
)
df.index.name, df.columns.name = None, None


df["Color"] = df["Color"].str.split()
df = df.explode("Color").fillna("")

print(df[colnames])

Prints:

   Name                Address  Color
1  Alex  14 high street London   blue
1  Alex  14 high street London    red
2   Bob                         black

CodePudding user response：

Here's a vanilla python way to tackle the problem (rather than using pandas)

Load your data (you'll probably read the data from your file, but we'll use this placeholder instead)

s = """
Name
Alex

Address
14 high street
London

Color
blue
red

Name

Bob
Color
black
"""

Split up each entry by lines with 'Name'

entities = [e for e in s.split('Name')]

produces

[
 '\n',
 '\nAlex\n\nAddress\n14 high street\nLondon\n\nColor\nblue\nred\n\n',
 '\n\nBob\nColor\nblack\n'
]

Replace the newlines with spaces, then clean up duplicate spaces

entities = [e.replace('\n',' ').replace('  ',' ') for e in entities]

produces

[
 ' ',
 ' Alex Address 14 high street London Color blue red ',
 ' Bob Color black '
]

Split each entry by space and toss any empty list entries

entities = [
    [x for x in e.split(' ') if x != '']
    for e in entities
]

produces

[
 [],
 ['Alex', 'Address', '14', 'high', 'street', 'London', 'Color', 'blue', 'red'],
 ['Bob', 'Color', 'black']
]

Get rid of any empty entities

entities = [e for e in entities if len(e) > 0]

produces

[
 ['Alex', 'Address', '14', 'high', 'street', 'London', 'Color', 'blue', 'red'],
 ['Bob', 'Color', 'black']
]

Establish our tokens, then for the index of each token, capture the list elements that appear before the next token

# We'll store our findings here.  
# We'll use the name, which is the first element of our 'entity' list, 
# as the key for this dict.

properties = {}

for entity in entities:
    # We'll figure out where each token shows up in our list
    token_indices = []
    for token in tokens:
        # we could use 
        # token_indices.append(entity.index(token)) 
        # if we knew that each token would only show up once
        token_indices  = [i for i,t in enumerate(entity) if t==token]

    # now we'll sort the list of indices so that we can be sure
    # we're dealing with them in order
    token_indices = sorted(token_indices)
    
    # since we haven't seen this person before, we'll establish a 
    # dict for their properties.



    individual_properties = {}
    
    for k,_ in enumerate(token_indices):
        this_tkn_name = entity[token_indices[k]]
        this_tkn_idx = token_indices[k]
        next_tkn_idx = token_indices[k 1]

        # We'll iterate over each token's index
        if k 1 < len(token_indices):
            # this isn't the last token
            individual_properties[
                this_tkn_name
            ] = ' '.join(entity[this_tkn_idx 1:next_tkn_idx])

        else:
            # this is the last token
            individual_properties[
                this_tkn_name
            ] = ' '.join(entity[this_tkn_idx 1:])
        
    # the first element in the entity list is their name, so we can
    # find that with entity[0]
    properties[entity[0]] = individual_properties

produces

{
  'Alex': {
    'Address': '14 high street London', 
    'Color': 'blue red'
  },
  'Bob': {
    'Color': 'black'
  }
}

NOTE: Depending on what you want to do with this, you may need further processing. Maybe you know what the colors are a list so you could use split(' ') to get a list instead of a single string.