How do I split a Pandas dataframe after a row using string conditions and then make the strings a va-CodePudding

I have XY datasets that have been concatenated together along with labels and are in one column. I want to split the dataset after the string labels, split the numerical data then then insert a third column of the label. So it goes from this:

     df = pd.DataFrame.from_dict(
    {
        'XYZ': ['Monday', '120   12', '119    51', '133   85', '1414   268',
              'Wednesday', '3   62', '4   27', 'Friday', '23   100', '155   300', '123   400'],
    }
)
print(df)

To this format...

    df2 = pd.DataFrame.from_dict(
    {
        'X': ['120', '119', '133', '1414', '3', '4', '23', '155', '123'],
        'Y': [ '12', '51', '85', '268', '62', '27', '100', '300', '400'],
        'z': [ 'Monday', 'Monday', 'Monday', 'Monday', 'Wednesday', 'Wednesday', 'Friday', 'Friday', 'Friday']
    }
)
print(df2)

What would be the best way to do this?

CodePudding user response：

You can use a regex to extract your 3 values (see the regex demo), then rework the rows with ffill and dropna:

df2 = (df['XYZ']
 # extract Z or X and Y
 .str.extract(r'(^\S $)|(\d )\s (\d )')
 .set_axis(['Z', 'X', 'Y'], axis=1)
 # ffill Z and remove the original header
 .assign(Z=lambda d: d['Z'].ffill().where(d['Z'].isna()))
 # remove original header row
 .dropna(subset='Z')
 # alternative:
 #.loc[lambda d: d['Z'].notna()]
 # convert to integer (optional)
 .astype({'X': int, 'Y': int})
)

output:

            Z     X    Y
1      Monday   120   12
2      Monday   119   51
3      Monday   133   85
4      Monday  1414  268
6   Wednesday     3   62
7   Wednesday     4   27
9      Friday    23  100
10     Friday   155  300
11     Friday   123  400

CodePudding user response：

One option with pandas string methods and conditional assignment:

(df
.XYZ
.str
.split(expand=True)
.set_axis(['x', 'y'], axis = 'columns')
.assign(z = lambda f: f.x.where(f.x.str.endswith('day')).ffill())
.loc[lambda f: f.x.ne(f.z)]
)

       x    y          z
1    120   12     Monday
2    119   51     Monday
3    133   85     Monday
4   1414  268     Monday
6      3   62  Wednesday
7      4   27  Wednesday
9     23  100     Friday
10   155  300     Friday
11   123  400     Friday

Another option is to reshape within vanilla python before recreating the dataframe; it is more verbose, but might be more performant:

from collections import defaultdict

out = df.to_dict('list')['XYZ']
data = defaultdict(list)
lengths = []

for num, entry in enumerate(out):
    if entry.endswith('day'):
        data['z'].append(entry)
        lengths.append(num)
    else:
        x, y = entry.split()
        x = x.strip()
        y = y.strip()
        data['x'].append(int(x))
        data['y'].append(int(y))

lengths.append(len(out))
lengths = np.array(lengths)
lengths = lengths[1:] - lengths[:-1] - 1

data = {key : value 
        if key != 'z' 
        else np.repeat(value, lengths)
        for key, value in data.items()}

pd.DataFrame(data)

           z     x    y
0     Monday   120   12
1     Monday   119   51
2     Monday   133   85
3     Monday  1414  268
4  Wednesday     3   62
5  Wednesday     4   27
6     Friday    23  100
7     Friday   155  300
8     Friday   123  400