I have the below dataframe data sample,
val df= spark.read.option("inferSchema",true).orc("abc/path/abc.snappy.orc")
df.show()
ID, date, timestamp, count, idcount, unit, code, Pcode, ccode, bid, vcode
12345432,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds
I want to write a pyspark code to generate more no. of record by just incrementing the ID and remaining column as it is.
Example
12345432,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds
12345433,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds
12345434,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds
12345435,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds
12345436,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds
12345437,10-11-2011,11:11:12.555,0,0,XVC_AS,12,14,19,123454323,qweds
I tried using lit, but not able to arrive at exact code
CodePudding user response:
After the update my solution seems to be quite cumnbersome but this is all i can offer so far.
Assuming you have your ID column as dataframe index, you can simply do:
import numpy as np
incr = 10
df = df.reindex(np.append(df.index.values,
range(df.index.max() 1, df.index.max() incr)),
method="ffill")
and get:
date timestamp count idcount unit code Pcode ccode bid vcode
ID
12345432 10-11-2011 11:11:12.555 0 0 XVC_AS 12 14 19 123454323 qweds
12345433 10-11-2011 11:11:12.555 0 0 XVC_AS 12 14 19 123454323 qweds
12345434 10-11-2011 11:11:12.555 0 0 XVC_AS 12 14 19 123454323 qweds
12345435 10-11-2011 11:11:12.555 0 0 XVC_AS 12 14 19 123454323 qweds
12345436 10-11-2011 11:11:12.555 0 0 XVC_AS 12 14 19 123454323 qweds
12345437 10-11-2011 11:11:12.555 0 0 XVC_AS 12 14 19 123454323 qweds
12345438 10-11-2011 11:11:12.555 0 0 XVC_AS 12 14 19 123454323 qweds
12345439 10-11-2011 11:11:12.555 0 0 XVC_AS 12 14 19 123454323 qweds
12345440 10-11-2011 11:11:12.555 0 0 XVC_AS 12 14 19 123454323 qweds
12345441 10-11-2011 11:11:12.555 0 0 XVC_AS 12 14 19 123454323 qweds
12345442 10-11-2011 11:11:12.555 0 0 XVC_AS 12 14 19 123454323 qweds
this is the proper way of reindexinf in case you already have a daraframe with more the one rows.
UPDATE due to question additions:
if you need to increment columns that are not index, i have nothing to offer more original than this:
incr = 10
df = df.reindex(np.append(df.index.values,
range(df.index.max() 1, df.index.max() incr)))#,
cols_to_incr = ["Pcode", "code"]
df = df.apply(lambda x: x.ffill() if x.name not in cols_to_incr else x)
for col in cols_to_incr:
df.loc[df[col].idxmax() 1:,col] = np.arange(df[col].max() 1, df[col].max() 1 len(df.loc[df[col].idxmax() 1:,col]))
date timestamp count idcount unit code Pcode ccode bid vcode
ID
12345432 10-11-2011 11:11:12.555 0.0 0.0 XVC_AS 12.0 14.0 19.0 123454323.0 qweds
12345433 10-11-2011 11:11:12.555 0.0 0.0 XVC_AS 13.0 15.0 19.0 123454323.0 qweds
12345434 10-11-2011 11:11:12.555 0.0 0.0 XVC_AS 14.0 16.0 19.0 123454323.0 qweds
12345435 10-11-2011 11:11:12.555 0.0 0.0 XVC_AS 15.0 17.0 19.0 123454323.0 qweds
12345436 10-11-2011 11:11:12.555 0.0 0.0 XVC_AS 16.0 18.0 19.0 123454323.0 qweds
12345437 10-11-2011 11:11:12.555 0.0 0.0 XVC_AS 17.0 19.0 19.0 123454323.0 qweds
12345438 10-11-2011 11:11:12.555 0.0 0.0 XVC_AS 18.0 20.0 19.0 123454323.0 qweds
12345439 10-11-2011 11:11:12.555 0.0 0.0 XVC_AS 19.0 21.0 19.0 123454323.0 qweds
12345440 10-11-2011 11:11:12.555 0.0 0.0 XVC_AS 20.0 22.0 19.0 123454323.0 qweds
12345441 10-11-2011 11:11:12.555 0.0 0.0 XVC_AS 21.0 23.0 19.0 123454323.0 qweds
12345442 10-11-2011 11:11:12.555 0.0 0.0 XVC_AS 22.0 24.0 19.0 123454323.0 qweds