How to find end position of a column's value in another column in Pyspark?-CodePudding

I have a situation where I need to extract some information from a column on the basis of another column. The table size is quite big. It is having almost 50 columns and 70M records. Attaching screenshot below to explain the situation.

id     idkey            ValuesNeededInAnotherColumn
-----  ---------------  ---------------------------
  123  10012300152                              152
12340  100001234001400                         1400
   12  20123152                                3152
  253  5000253

so table is having a column idkey. idkey is made up of some values like companycode(100) id(123) custcode(00152) = 10012300152. The length of company code and id is not fixed. They can vary in length but position of data is fixed so the value after id in idkey is always fixed. If there is nothing in idkey after id value that means custcode is null. The solution which i am trying to implement is to find the position of id column in idkey column and then substring the value till end and cast it as int. This solution is taking too much time as i have to convert dataframe into rdds as it is not possible to do it directly on dataframe.

Anyone having an optimized solution that can be implemented on quite big table then please help.

CodePudding user response：

I would have concerns about ambiguity. For example:

id     idkey            ValuesNeededInAnotherColumn
-----  ---------------  ---------------------------
   12  120123012            123012 or 3012 or null?

Not considering that, it's relatively simple use of regexp_extract function.

SELECT id, idkey,
       cast(regexp_extract(idkey, concat(id, '(.*)'), 1) as int) as ValuesNeededInAnotherColumn
  FROM df;