I have a somewhat involved transformation of my data where I was wondering if someone had a more efficient method than mine. I start with a dataframe as this one:
| | item | value |
|---:|:------------------|--------:|
| 0 | WAUZZZF23MN053792 | 0 |
| 1 | A | 1 |
| 2 | WF0TK3SS2MMA50940 | 0 |
| 3 | A | 10 |
| 4 | B | 11 |
| 5 | C | 12 |
| 6 | D | 13 |
| 7 | E | 14 |
| 8 | W0VEAZKXZMJ857138 | 0 |
| 9 | A | 20 |
| 10 | B | 21 |
| 11 | C | 22 |
| 12 | D | 23 |
| 13 | E | 24 |
| 14 | W0VEAZKXZMJ837930 | 0 |
| 15 | A | 30 |
| 16 | B | 31 |
| 17 | C | 32 |
| 18 | D | 33 |
| 19 | E | 34 |
and I would like to arrive here:
| | item | value | C |
|---:|:------------------|--------:|----:|
| 0 | WAUZZZF23MN053792 | 0 | nan |
| 1 | WF0TK3SS2MMA50940 | 0 | 12 |
| 2 | W0VEAZKXZMJ857138 | 0 | 22 |
| 3 | W0VEAZKXZMJ837930 | 0 | 32 |
i.e. for every "long" entry, check if there is an item "C" following, and if so, copy that line's value to the line with the long item.
The ugly way i have done this is the following:
import re
import pandas as pd
df = pd.DataFrame(
{
"item": [
"WAUZZZF23MN053792",
"A",
"WF0TK3SS2MMA50940",
"A",
"B",
"C",
"D",
"E",
"W0VEAZKXZMJ857138",
"A",
"B",
"C",
"D",
"E",
"W0VEAZKXZMJ837930",
"A",
"B",
"C",
"D",
"E",
],
"value": [
0,
1,
0,
10,
11,
12,
13,
14,
0,
20,
21,
22,
23,
24,
0,
30,
31,
32,
33,
33,
],
}
)
def isVIN(x):
return (len(x) == 17) & (x.upper() == x) & (re.search("\s|O|I", x) is None)
# filter the lines with item=="C" or a VIN in item
x = pd.concat([df, df["item"].rename("group").apply(isVIN).cumsum()], axis=1).loc[
lambda x: (x["item"] == "C") | (x["item"].apply(isVIN))
]
# pivot the lines where item="C"
y = x.loc[x["item"] == "C"].pivot(columns="item").droplevel(level=1, axis=1)
# and then merge the two:
print(
x.loc[x["item"].apply(isVIN)]
.merge(y, on="group", how="left")
.drop("group", axis=1)
.rename(columns={"value_y": "C", "value_x": "value"})
.to_markdown()
)
Does anyone have an idea how to make this a bit less ugly?
CodePudding user response:
Subjectively less ugly
mask = df.item.str.len().eq(17)
df.set_index(
[df.item.where(mask).ffill(), 'item']
)[~mask.to_numpy()].value.unstack()['C'].reset_index()
item C
0 W0VEAZKXZMJ837930 32.0
1 W0VEAZKXZMJ857138 22.0
2 WAUZZZF23MN053792 NaN
3 WF0TK3SS2MMA50940 12.0
A bit more involved but better
mask = df.item.str.len().eq(17)
item = df.item.where(mask).pad()
subs = df.item.mask(mask)
valu = df.value
i, r = pd.factorize(item)
j, c = pd.factorize(subs)
a = np.zeros((len(r), len(c)), valu.dtype)
a[i, j] = valu
pd.DataFrame(a, r, c)[['C']].rename_axis('item').reset_index()
item C
0 WAUZZZF23MN053792 0
1 WF0TK3SS2MMA50940 12
2 W0VEAZKXZMJ857138 22
3 W0VEAZKXZMJ837930 32
CodePudding user response:
Try:
# Your conditions vectorized
m = ((df['item'].str.len() == 17)
& (df['item'].str.upper() == df['item'])
& (~df['item'].str.contains(r'\s|O|I')))
# Create virtual groups to align rows
df['grp'] = m.cumsum()
# Merge and align rows
out = (pd.concat([df[m].set_index('grp'),
df[~m].pivot('grp', 'item', 'value')], axis=1)
.reset_index(drop=True))
Output:
>>> out
item value A B C D E
0 WAUZZZF23MN053792 0 1.0 NaN NaN NaN NaN
1 WF0TK3SS2MMA50940 0 10.0 11.0 12.0 13.0 14.0
2 W0VEAZKXZMJ857138 0 20.0 21.0 22.0 23.0 24.0
3 W0VEAZKXZMJ837930 0 30.0 31.0 32.0 33.0 33.0
CodePudding user response:
How about this with datar
, a pandas wrapper that reimagines pandas APIs:
Construct data
>>> import re
>>> from datar.all import (
... c, f, LETTERS, tibble, first, cumsum,
... mutate, group_by, slice, first, pivot_wider, select
... )
>>>
>>> df = tibble(
... item=c(
... "WAUZZZF23MN053792",
... "A",
... "WF0TK3SS2MMA50940",
... LETTERS[:5],
... "W0VEAZKXZMJ857138",
... LETTERS[:5],
... "W0VEAZKXZMJ837930",
... LETTERS[:5],
... ),
... value=c(
... 0, 1,
... 0, f[10:15],
... 0, f[20:25],
... 0, f[30:35],
... )
... )
>>> df
item value
<object> <int64>
0 WAUZZZF23MN053792 0
1 A 1
2 WF0TK3SS2MMA50940 0
3 A 10
4 B 11
5 C 12
6 D 13
7 E 14
8 W0VEAZKXZMJ857138 0
9 A 20
10 B 21
11 C 22
12 D 23
13 E 24
14 W0VEAZKXZMJ837930 0
15 A 30
16 B 31
17 C 32
18 D 33
19 E 34
Manipulate data
>>> def isVIN(x):
... return len(x) == 17 and x.isupper() and re.search(r"\s|O|I", x) is None
...
>>> (
... df
... # Mark the VIN groups
... >> mutate(is_vin=cumsum(f.item.transform(isVIN)))
... # Group by VINs
... >> group_by(f.is_vin)
... # Put the VINs and their values in new columns
... >> mutate(vin=first(f.item), vin_value=first(f.value))
... # Exclude VINs in the items
... >> slice(~c(0))
... # Get the values of A, B, C ...
... >> pivot_wider([f.vin, f.vin_value], names_from=f.item, values_from=f.value)
... # Select and rename columns
... >> select(item=f.vin, value=f.vin_value, C=f.C)
... )
item value C
<object> <int64> <float64>
0 W0VEAZKXZMJ837930 0 32.0
1 W0VEAZKXZMJ857138 0 22.0
2 WAUZZZF23MN053792 0 NaN
3 WF0TK3SS2MMA50940 0 12.0
CodePudding user response:
The other answers are all very nice. For a bit more variety, you could also filter df
for "long" data and C
values; concat
; then "compress" the DataFrame using groupby
first
:
out = pd.concat([df[df['item'].str.len()==17],
df.loc[df['item']=='C', ['value']].set_axis(['C'], axis=1)], axis=1)
out = out.groupby(out['item'].str.len().eq(17).cumsum()).first().reset_index(drop=True)
Output:
item value C
0 WAUZZZF23MN053792 0.0 NaN
1 WF0TK3SS2MMA50940 0.0 12.0
2 W0VEAZKXZMJ857138 0.0 22.0
3 W0VEAZKXZMJ837930 0.0 32.0