converting a text file with tags in a pandas dataframe with tags as fields-CodePudding

I have a text file with this structure:

- ASDF : |a=1|b=1|c=1|d=1
- QWER : |b=2|e=2|f=2
- ZXCV : |a=3|c=3|e=3|f=3|g=3
- TREW : |a=4|b=4|g=4

and I'd like to create a dataframe like this:

index	A	B	C	D	E	F	G
ASDF	1	1	1	1	NaN	NaN	NaN
QWER	NaN	2	NaN	NaN	NaN	2	NaN
ZXCV	3	NaN	3	NaN	3	3	3
TREW	4	4	NaN	NaN	NaN	NaN	4

which solution could I implement? consider that I don't know in advance how many rows I have, nor the number or name of the tags. thank you all in advance!

CodePudding user response：

Use read_csv first, then convert first column to index and split by | by Series.str.split, reshape by DataFrame.stack, split by =, convert index to columns and pivoting by DataFrame.pivot, last create uppercase columns names:

df = pd.read_csv('file', sep=':', header=None)


df1 = (df.set_index(0)[1]
         .str.split('|', expand=True)
         .stack()
         .str.split('=', expand=True)
         .dropna()
         .rename_axis(['a','b'])
         .reset_index()
         .pivot('a',0,1)
         .rename(columns=str.upper, index = lambda x: x.strip(' -'))
         .rename_axis(index=None, columns=None))
         
print (df1)
        A    B    C    D    E    F    G
ASDF    1    1    1    1  NaN  NaN  NaN
QWER  NaN    2  NaN  NaN    2    2  NaN
TREW    4    4  NaN  NaN  NaN  NaN    4
ZXCV    3  NaN    3  NaN    3    3    3

CodePudding user response：

I would harness regular expression for that following way

import re
import pandas as pd
text = '''ASDF : |a=1|b=1|c=1|d=1
QWER : |b=2|e=2|f=2
ZXCV : |a=3|c=3|e=3|f=3|g=3
TREW : |a=4|b=4|g=4'''
def extract_data(line):
    name = line.split()[0]  # name is everything before first whitespace
    data = dict(re.findall(r'([a-z] )=(\d )',line))
    return {"name":name,**data}
df = pd.DataFrame(map(extract_data,text.splitlines()))
df.set_index("name",inplace=True)
print(df)

output

        a    b    c    d    e    f    g
name
ASDF    1    1    1    1  NaN  NaN  NaN
QWER  NaN    2  NaN  NaN    2    2  NaN
ZXCV    3  NaN    3  NaN    3    3    3
TREW    4    4  NaN  NaN  NaN  NaN    4

Explanation: I use regular expression with capturing group to find pairs of one-or-more lowercase ASCII characters [a-z] and one-or-more digits \d sheared by requal sign =, then conver that into dict and do build dict from that and name. Disclaimer: this solution assumes every name appears no more than once in each line.

CodePudding user response：

You can use read_csv to load the data, then replace to fix the index name (removing the - ). str.extractall to get the key=value pairs, and reshape with unstack:

# using io.StringIO for the example
# in real-life use your file as input
import io
data = '''- ASDF : |a=1|b=1|c=1|d=1
- QWER : |b=2|e=2|f=2
- ZXCV : |a=3|c=3|e=3|f=3|g=3
- TREW : |a=4|b=4|g=4'''

# in real-life use:
#    (pd.read_csv('filename.txt',    sep=' : ', header=None,
df = (pd.read_csv(io.StringIO(data), sep=' : ', header=None,
                  names=['index', 'val'], engine='python')
        # cleanup index column
        .assign(index=lambda d: d['index'].replace('^-\s*', '', regex=True))
        # extract the key=val pairs and reshape
        # then merge to original
        .pipe(lambda d: d.join(d.pop('val')
                                .str.extractall('(\w )=(\d )')
                                .droplevel('match')
                                .set_index(0, append=True)
                                [1].unstack(0)
                              )
             )
     )

Alternative with a python dictionary comprehension (if your format is strictly what you showed):

with open('filename.txt') as f:
    df = (pd.DataFrame
            .from_dict({k[2:]: dict(x.split('=') for x in v[1:].split('|'))
                        for l in f if l.strip()
                        for k,v in [l.strip().split(' : ', 1)]},
                       orient='index')
            .reset_index()
          )

output:

  index    a    b    c    d    e    f    g
0  ASDF    1    1    1    1  NaN  NaN  NaN
1  QWER  NaN    2  NaN  NaN    2    2  NaN
2  ZXCV    3  NaN    3  NaN    3    3    3
3  TREW    4    4  NaN  NaN  NaN  NaN    4