I have XY datasets that have been concatenated together along with labels and are in one column. I want to split the dataset after the string labels, split the numerical data then then insert a third column of the label. So it goes from this:
df = pd.DataFrame.from_dict(
{
'XYZ': ['Monday', '120 12', '119 51', '133 85', '1414 268',
'Wednesday', '3 62', '4 27', 'Friday', '23 100', '155 300', '123 400'],
}
)
print(df)
To this format...
df2 = pd.DataFrame.from_dict(
{
'X': ['120', '119', '133', '1414', '3', '4', '23', '155', '123'],
'Y': [ '12', '51', '85', '268', '62', '27', '100', '300', '400'],
'z': [ 'Monday', 'Monday', 'Monday', 'Monday', 'Wednesday', 'Wednesday', 'Friday', 'Friday', 'Friday']
}
)
print(df2)
What would be the best way to do this?
CodePudding user response:
You can use a regex to extract your 3 values (see the regex demo), then rework the rows with ffill
and dropna
:
df2 = (df['XYZ']
# extract Z or X and Y
.str.extract(r'(^\S $)|(\d )\s (\d )')
.set_axis(['Z', 'X', 'Y'], axis=1)
# ffill Z and remove the original header
.assign(Z=lambda d: d['Z'].ffill().where(d['Z'].isna()))
# remove original header row
.dropna(subset='Z')
# alternative:
#.loc[lambda d: d['Z'].notna()]
# convert to integer (optional)
.astype({'X': int, 'Y': int})
)
output:
Z X Y
1 Monday 120 12
2 Monday 119 51
3 Monday 133 85
4 Monday 1414 268
6 Wednesday 3 62
7 Wednesday 4 27
9 Friday 23 100
10 Friday 155 300
11 Friday 123 400
CodePudding user response:
One option with pandas string methods and conditional assignment:
(df
.XYZ
.str
.split(expand=True)
.set_axis(['x', 'y'], axis = 'columns')
.assign(z = lambda f: f.x.where(f.x.str.endswith('day')).ffill())
.loc[lambda f: f.x.ne(f.z)]
)
x y z
1 120 12 Monday
2 119 51 Monday
3 133 85 Monday
4 1414 268 Monday
6 3 62 Wednesday
7 4 27 Wednesday
9 23 100 Friday
10 155 300 Friday
11 123 400 Friday
Another option is to reshape within vanilla python before recreating the dataframe; it is more verbose, but might be more performant:
from collections import defaultdict
out = df.to_dict('list')['XYZ']
data = defaultdict(list)
lengths = []
for num, entry in enumerate(out):
if entry.endswith('day'):
data['z'].append(entry)
lengths.append(num)
else:
x, y = entry.split()
x = x.strip()
y = y.strip()
data['x'].append(int(x))
data['y'].append(int(y))
lengths.append(len(out))
lengths = np.array(lengths)
lengths = lengths[1:] - lengths[:-1] - 1
data = {key : value
if key != 'z'
else np.repeat(value, lengths)
for key, value in data.items()}
pd.DataFrame(data)
z x y
0 Monday 120 12
1 Monday 119 51
2 Monday 133 85
3 Monday 1414 268
4 Wednesday 3 62
5 Wednesday 4 27
6 Friday 23 100
7 Friday 155 300
8 Friday 123 400