Home > other >  Pandas transpose raw data with inconsistent row pattern
Pandas transpose raw data with inconsistent row pattern

Time:05-22

I have the following raw data export:

import pandas as pd
df = pd.DataFrame({ 
     'Data': ['A15','abc','A25',50,'abc','A30',20,'def','A4','hijk','A',10,'mnop'],

    })
df

enter image description here

I am trying to transpose this raw data to a table with 3 columns: Name, Number and Text

I would like a row for every time A appears as this is the consistent pattern. After that there is always a text but only sometimes there is a number. If this number appears it is always the direct row after the A. My expected output is this:

enter image description here

Any ideas on how i can approach this? Thanks very much!

CodePudding user response:

You can define masks and use a pivot:

m1 = df['Data'].str.startswith('A')
m2 = m1.isna() #OR: pd.to_numeric(df['Data'], errors='coerce').notna()
m1 = m1.fillna(False)

df2 = (df
 .assign(index=m1.cumsum(),
         col=np.select([m1, m2], ['Name', 'Number'], 'Text')
        )
 .pivot(index='index', columns='col', values='Data')
)

output:

col   Name Number  Text
index                  
1      A15    NaN   abc
2      A25     50   abc
3      A30     20   def
4       A4    NaN  hijk
5        A     10  mnop

intermediates:

    Data     m1  m1(v2)     m2  m1(cumsum)
0    A15   True    True  False           1
1    abc  False   False  False           1
2    A25   True    True  False           2
3     50    NaN   False   True           2
4    abc  False   False  False           2
5    A30   True    True  False           3
6     20    NaN   False   True           3
7    def  False   False  False           3
8     A4   True    True  False           4
9   hijk  False   False  False           4
10     A   True    True  False           5
11    10    NaN   False   True           5
12  mnop  False   False  False           5

CodePudding user response:

import re
import pandas as pd
import numpy as np

df = pd.DataFrame({'Data': ['A15', 'abc', 'A25', 50, 'abc', 'A30', 20, 'def', 'A4', 'hijk', 'A', 10, 'mnop']})

convert to list:

flat_list = df['Data'].tolist()

iterate over list, if element matches A\d add new sublist otherwise append to last sublist:

nested_list = []
while len(flat_list) > 0:
    element = flat_list.pop(0)
    if re.fullmatch("A\d*", str(element)):
        nested_list.append([element])
    else:
        nested_list[-1].append(element)

to list of dicts where 'Number' is np.NaN if a sublist has only two items:

as_records = [
    {'Name': l[0], 'Number': l[1], 'text': l[2]} if len(l) == 3 else {'Name': l[0], 'Number': np.NaN, 'text': l[1]} 
    for l in nested_list]

convert to DataFrame:

df_out = pd.DataFrame(as_records)

Which returns:

  Name  Number  text
0  A15     NaN   abc
1  A25    50.0   abc
2  A30    20.0   def
3   A4     NaN  hijk
4    A    10.0  mnop
  • Related