Home > other >  Duplicating a single row in a dataframe
Duplicating a single row in a dataframe

Time:09-30

I want to write a function:

def dupRow(df, val):

which takes a DataFrame, and a value. it finds the row in the 'val' column and duplicates just that row. So, for example if df=

 val  data1  data2  data3
0   a      3      1      9
1   b     89      2      8
2   c      7      3      7
3   d      0      4      6

then dupRow(df, 'c') returns:

  val  data1  data2  data3
0   a      3      1      9
1   b     89      2      8
2   c      7      3      7
3   c      7      3      7
4   d      0      4      6

It can put the duplicated row at the bottom, I can reorder rows when I'm done, it's just easier to see this way.

I've seen a bunch of things using np.repeat, but I can't figure out how to get it to that only once rather than on the entire index...

CodePudding user response:

IIUC, try:

def dupRow(df, val):
    return df.append(df[df["val"].eq(val)]).sort_values("val").reset_index(drop=True)

>>> dupRow(df, 'c')
  val  data1  data2  data3
0   a      3      1      9
1   b     89      2      8
2   c      7      3      7
3   c      7      3      7
4   d      0      4      6

CodePudding user response:

Usually we do reindex

row = 'c'
out = df.reindex(df.index.append(df.index[df.val==row])).sort_index()
Out[27]: 
  val  data1  data2  data3
0   a      3      1      9
1   b     89      2      8
2   c      7      3      7
2   c      7      3      7
3   d      0      4      6

CodePudding user response:

You can do:

def dupRow(df, val):
    return df[df.val!=val].append([df[df.val==val]]*2)

Example:

data = {'val': [1,2,3,4],
        'col2': ['a','b','c','d']}
df = pd.DataFrame(data)

df is:

   val col2
0    1    a
1    2    b
2    3    c
3    4    d

dupRow(df,3) returns:

   val col2
0    1    a
1    2    b
3    4    d
2    3    c
2    3    c

CodePudding user response:

You can use Index.repeat:

def dupRow(df, val):
    return df.reindex(df.index.repeat(df['val'].eq('c').astype(int).add(1)))

>>> dupRow(df, 'c')
  val  data1  data2  data3
0   a      3      1      9
1   b     89      2      8
2   c      7      3      7
2   c      7      3      7
3   d      0      4      6

Just for information, it's probably not significative:

def dupRow_zabop(df, val):
    return df[df.val!=val].append([df[df.val==val]]*2)

def dupRow_not_speshal(df, val):
    return df.append(df[df["val"].eq(val)]).sort_values("val").reset_index(drop=True)

def dupRow_beny(df, val):
    return df.reindex(df.index.append(df.index[df.val==val])).sort_index()

def dupRow_corralien(df, val):
    return df.reindex(df.index.repeat(df['val'].eq('c').astype(int).add(1)))
%timeit dupRow_zabop(df, 'c')
744 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit dupRow_not_speshal(df, 'c')
714 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit dupRow_beny(df, 'c')
393 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit dupRow_corralien(df, 'c')
345 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

append is a more expensive operation compared to reindex.

  • Related