I want to write a function:
def dupRow(df, val):
which takes a DataFrame, and a value. it finds the row in the 'val' column and duplicates just that row. So, for example if df=
val data1 data2 data3
0 a 3 1 9
1 b 89 2 8
2 c 7 3 7
3 d 0 4 6
then dupRow(df, 'c')
returns:
val data1 data2 data3
0 a 3 1 9
1 b 89 2 8
2 c 7 3 7
3 c 7 3 7
4 d 0 4 6
It can put the duplicated row at the bottom, I can reorder rows when I'm done, it's just easier to see this way.
I've seen a bunch of things using np.repeat, but I can't figure out how to get it to that only once rather than on the entire index...
CodePudding user response:
IIUC, try:
def dupRow(df, val):
return df.append(df[df["val"].eq(val)]).sort_values("val").reset_index(drop=True)
>>> dupRow(df, 'c')
val data1 data2 data3
0 a 3 1 9
1 b 89 2 8
2 c 7 3 7
3 c 7 3 7
4 d 0 4 6
CodePudding user response:
Usually we do reindex
row = 'c'
out = df.reindex(df.index.append(df.index[df.val==row])).sort_index()
Out[27]:
val data1 data2 data3
0 a 3 1 9
1 b 89 2 8
2 c 7 3 7
2 c 7 3 7
3 d 0 4 6
CodePudding user response:
You can do:
def dupRow(df, val):
return df[df.val!=val].append([df[df.val==val]]*2)
Example:
data = {'val': [1,2,3,4],
'col2': ['a','b','c','d']}
df = pd.DataFrame(data)
df
is:
val col2
0 1 a
1 2 b
2 3 c
3 4 d
dupRow(df,3)
returns:
val col2
0 1 a
1 2 b
3 4 d
2 3 c
2 3 c
CodePudding user response:
You can use Index.repeat
:
def dupRow(df, val):
return df.reindex(df.index.repeat(df['val'].eq('c').astype(int).add(1)))
>>> dupRow(df, 'c')
val data1 data2 data3
0 a 3 1 9
1 b 89 2 8
2 c 7 3 7
2 c 7 3 7
3 d 0 4 6
Just for information, it's probably not significative:
def dupRow_zabop(df, val):
return df[df.val!=val].append([df[df.val==val]]*2)
def dupRow_not_speshal(df, val):
return df.append(df[df["val"].eq(val)]).sort_values("val").reset_index(drop=True)
def dupRow_beny(df, val):
return df.reindex(df.index.append(df.index[df.val==val])).sort_index()
def dupRow_corralien(df, val):
return df.reindex(df.index.repeat(df['val'].eq('c').astype(int).add(1)))
%timeit dupRow_zabop(df, 'c')
744 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dupRow_not_speshal(df, 'c')
714 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dupRow_beny(df, 'c')
393 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dupRow_corralien(df, 'c')
345 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
append
is a more expensive operation compared to reindex
.