Home > Software engineering >  need to generate a new data frame with more no. of similar record from an existing data frame
need to generate a new data frame with more no. of similar record from an existing data frame

Time:11-27

I have the below dataframe data sample,

val df= spark.read.option("inferSchema",true).orc("abc/path/abc.snappy.orc")

df.show()
ID, date, timestamp, count, idcount, unit, code, Pcode, ccode, bid, vcode

12345432,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds

I want to write a pyspark code to generate more no. of record by just incrementing the ID and remaining column as it is.

Example

12345432,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds

12345433,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds

12345434,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds

12345435,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds

12345436,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds

12345437,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds

I tried using lit, but not able to arrive at exact code

CodePudding user response:

After the update my solution seems to be quite cumnbersome but this is all i can offer so far.

Assuming you have your ID column as dataframe index, you can simply do:

import numpy as np
incr = 10
df = df.reindex(np.append(df.index.values,
                          range(df.index.max() 1, df.index.max() incr)),
                method="ffill")

and get:

                date     timestamp  count  idcount    unit  code  Pcode  ccode        bid  vcode
ID                                                                                              
12345432  10-11-2011  11:11:12.555      0        0  XVC_AS    12     14     19  123454323  qweds
12345433  10-11-2011  11:11:12.555      0        0  XVC_AS    12     14     19  123454323  qweds
12345434  10-11-2011  11:11:12.555      0        0  XVC_AS    12     14     19  123454323  qweds
12345435  10-11-2011  11:11:12.555      0        0  XVC_AS    12     14     19  123454323  qweds
12345436  10-11-2011  11:11:12.555      0        0  XVC_AS    12     14     19  123454323  qweds
12345437  10-11-2011  11:11:12.555      0        0  XVC_AS    12     14     19  123454323  qweds
12345438  10-11-2011  11:11:12.555      0        0  XVC_AS    12     14     19  123454323  qweds
12345439  10-11-2011  11:11:12.555      0        0  XVC_AS    12     14     19  123454323  qweds
12345440  10-11-2011  11:11:12.555      0        0  XVC_AS    12     14     19  123454323  qweds
12345441  10-11-2011  11:11:12.555      0        0  XVC_AS    12     14     19  123454323  qweds
12345442  10-11-2011  11:11:12.555      0        0  XVC_AS    12     14     19  123454323  qweds

this is the proper way of reindexinf in case you already have a daraframe with more the one rows.

UPDATE due to question additions:

if you need to increment columns that are not index, i have nothing to offer more original than this:

incr = 10

df = df.reindex(np.append(df.index.values,
                          range(df.index.max() 1, df.index.max() incr)))#,
cols_to_incr = ["Pcode", "code"]
df = df.apply(lambda x: x.ffill() if x.name not in cols_to_incr else x)

for col in cols_to_incr:
    df.loc[df[col].idxmax() 1:,col] = np.arange(df[col].max() 1, df[col].max() 1 len(df.loc[df[col].idxmax() 1:,col]))
                date     timestamp  count  idcount    unit  code  Pcode  ccode          bid  vcode
ID                                                                                                
12345432  10-11-2011  11:11:12.555    0.0      0.0  XVC_AS  12.0   14.0   19.0  123454323.0  qweds
12345433  10-11-2011  11:11:12.555    0.0      0.0  XVC_AS  13.0   15.0   19.0  123454323.0  qweds
12345434  10-11-2011  11:11:12.555    0.0      0.0  XVC_AS  14.0   16.0   19.0  123454323.0  qweds
12345435  10-11-2011  11:11:12.555    0.0      0.0  XVC_AS  15.0   17.0   19.0  123454323.0  qweds
12345436  10-11-2011  11:11:12.555    0.0      0.0  XVC_AS  16.0   18.0   19.0  123454323.0  qweds
12345437  10-11-2011  11:11:12.555    0.0      0.0  XVC_AS  17.0   19.0   19.0  123454323.0  qweds
12345438  10-11-2011  11:11:12.555    0.0      0.0  XVC_AS  18.0   20.0   19.0  123454323.0  qweds
12345439  10-11-2011  11:11:12.555    0.0      0.0  XVC_AS  19.0   21.0   19.0  123454323.0  qweds
12345440  10-11-2011  11:11:12.555    0.0      0.0  XVC_AS  20.0   22.0   19.0  123454323.0  qweds
12345441  10-11-2011  11:11:12.555    0.0      0.0  XVC_AS  21.0   23.0   19.0  123454323.0  qweds
12345442  10-11-2011  11:11:12.555    0.0      0.0  XVC_AS  22.0   24.0   19.0  123454323.0  qweds
  • Related