I have a text file with this structure:
- ASDF : |a=1|b=1|c=1|d=1
- QWER : |b=2|e=2|f=2
- ZXCV : |a=3|c=3|e=3|f=3|g=3
- TREW : |a=4|b=4|g=4
and I'd like to create a dataframe like this:
index | A | B | C | D | E | F | G |
---|---|---|---|---|---|---|---|
ASDF | 1 | 1 | 1 | 1 | NaN | NaN | NaN |
QWER | NaN | 2 | NaN | NaN | NaN | 2 | NaN |
ZXCV | 3 | NaN | 3 | NaN | 3 | 3 | 3 |
TREW | 4 | 4 | NaN | NaN | NaN | NaN | 4 |
which solution could I implement? consider that I don't know in advance how many rows I have, nor the number or name of the tags. thank you all in advance!
CodePudding user response:
Use read_csv
first, then convert first column to index
and split by |
by Series.str.split
, reshape by DataFrame.stack
, split by =
, convert index to columns and pivoting by DataFrame.pivot
, last create uppercase columns names:
df = pd.read_csv('file', sep=':', header=None)
df1 = (df.set_index(0)[1]
.str.split('|', expand=True)
.stack()
.str.split('=', expand=True)
.dropna()
.rename_axis(['a','b'])
.reset_index()
.pivot('a',0,1)
.rename(columns=str.upper, index = lambda x: x.strip(' -'))
.rename_axis(index=None, columns=None))
print (df1)
A B C D E F G
ASDF 1 1 1 1 NaN NaN NaN
QWER NaN 2 NaN NaN 2 2 NaN
TREW 4 4 NaN NaN NaN NaN 4
ZXCV 3 NaN 3 NaN 3 3 3
CodePudding user response:
I would harness regular expression for that following way
import re
import pandas as pd
text = '''ASDF : |a=1|b=1|c=1|d=1
QWER : |b=2|e=2|f=2
ZXCV : |a=3|c=3|e=3|f=3|g=3
TREW : |a=4|b=4|g=4'''
def extract_data(line):
name = line.split()[0] # name is everything before first whitespace
data = dict(re.findall(r'([a-z] )=(\d )',line))
return {"name":name,**data}
df = pd.DataFrame(map(extract_data,text.splitlines()))
df.set_index("name",inplace=True)
print(df)
output
a b c d e f g
name
ASDF 1 1 1 1 NaN NaN NaN
QWER NaN 2 NaN NaN 2 2 NaN
ZXCV 3 NaN 3 NaN 3 3 3
TREW 4 4 NaN NaN NaN NaN 4
Explanation: I use regular expression with capturing group to find pairs of one-or-more lowercase ASCII characters [a-z]
and one-or-more digits \d
sheared by requal sign =
, then conver that into dict
and do build dict from that and name
. Disclaimer: this solution assumes every name appears no more than once in each line.
CodePudding user response:
You can use read_csv
to load the data, then replace
to fix the index name (removing the -
). str.extractall
to get the key=value
pairs, and reshape with unstack
:
# using io.StringIO for the example
# in real-life use your file as input
import io
data = '''- ASDF : |a=1|b=1|c=1|d=1
- QWER : |b=2|e=2|f=2
- ZXCV : |a=3|c=3|e=3|f=3|g=3
- TREW : |a=4|b=4|g=4'''
# in real-life use:
# (pd.read_csv('filename.txt', sep=' : ', header=None,
df = (pd.read_csv(io.StringIO(data), sep=' : ', header=None,
names=['index', 'val'], engine='python')
# cleanup index column
.assign(index=lambda d: d['index'].replace('^-\s*', '', regex=True))
# extract the key=val pairs and reshape
# then merge to original
.pipe(lambda d: d.join(d.pop('val')
.str.extractall('(\w )=(\d )')
.droplevel('match')
.set_index(0, append=True)
[1].unstack(0)
)
)
)
Alternative with a python dictionary comprehension (if your format is strictly what you showed):
with open('filename.txt') as f:
df = (pd.DataFrame
.from_dict({k[2:]: dict(x.split('=') for x in v[1:].split('|'))
for l in f if l.strip()
for k,v in [l.strip().split(' : ', 1)]},
orient='index')
.reset_index()
)
output:
index a b c d e f g
0 ASDF 1 1 1 1 NaN NaN NaN
1 QWER NaN 2 NaN NaN 2 2 NaN
2 ZXCV 3 NaN 3 NaN 3 3 3
3 TREW 4 4 NaN NaN NaN NaN 4