I have a situation where I need to extract some information from a column on the basis of another column. The table size is quite big. It is having almost 50 columns and 70M records. Attaching screenshot below to explain the situation.
id idkey ValuesNeededInAnotherColumn
----- --------------- ---------------------------
123 10012300152 152
12340 100001234001400 1400
12 20123152 3152
253 5000253
so table is having a column idkey. idkey is made up of some values like companycode(100) id(123) custcode(00152) = 10012300152. The length of company code and id is not fixed. They can vary in length but position of data is fixed so the value after id in idkey is always fixed. If there is nothing in idkey after id value that means custcode is null. The solution which i am trying to implement is to find the position of id column in idkey column and then substring the value till end and cast it as int. This solution is taking too much time as i have to convert dataframe into rdds as it is not possible to do it directly on dataframe.
Anyone having an optimized solution that can be implemented on quite big table then please help.
CodePudding user response:
I would have concerns about ambiguity. For example:
id idkey ValuesNeededInAnotherColumn
----- --------------- ---------------------------
12 120123012 123012 or 3012 or null?
Not considering that, it's relatively simple use of regexp_extract function.
SELECT id, idkey,
cast(regexp_extract(idkey, concat(id, '(.*)'), 1) as int) as ValuesNeededInAnotherColumn
FROM df;