Flatten a column value using dataframe-CodePudding

Im trying to flatten 2 columns from a table loaded into a dataframe as below:

u_group	t_group
{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}	{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}
{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}	{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}

I want to separate them and get them as:

u_group.link	u_group.value	t_group.link	t_group.value
https://hi.com/api/now/table/system/2696f18b376bca0	2696f18b376bca0	https://hi.com/api/now/table/system/2696f18b376bca0	2696f18b376bca0
https://hi.com/api/now/table/system/99b27bc1db761f4	99b27bc1db761f4	https://hi.com/api/now/table/system/99b27bc1db761f4	99b27bc1db761f4

I used the below code, but wasnt successful.

import ast
from pandas.io.json import json_normalize

df12 = spark.sql("""select u_group,t_group from tbl""")

def only_dict(d):
    '''
    Convert json string representation of dictionary to a python dict
    '''
    return ast.literal_eval(d)

def list_of_dicts(ld):
    '''
    Create a mapping of the tuples formed after 
    converting json strings of list to a python list   
    '''
    return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])

A = json_normalize(df12['u_group'].apply(only_dict).tolist()).add_prefix('link.')
B = json_normalize(df['u_group'].apply(list_of_dicts).tolist()).add_prefix('value.')

TypeError: 'Column' object is not callable

Kindly help or suggest if any other code would work better.

CodePudding user response：

need simple example and code for answer

example:

data = [[{'link':'A1', 'value':'B1'}, {'link':'A2', 'value':'B2'}], 
        [{'link':'C1', 'value':'D1'}, {'link':'C2', 'value':'D2'}]]
df = pd.DataFrame(data, columns=['u', 't'])

output(df):

    u                               t
0   {'link': 'A1', 'value': 'B1'}   {'link': 'A2', 'value': 'B2'}
1   {'link': 'C1', 'value': 'D1'}   {'link': 'C2', 'value': 'D2'}

use following code:

pd.concat([df[i].apply(lambda x: pd.Series(x)).add_prefix(i   '_') for i in df.columns], axis=1)

output:

    u_link  u_value t_link  t_value
0   A1      B1      A2      B2
1   C1      D1      C2      D2

CodePudding user response：

Here are my 2 cents,

A simple way to achieve this using PYSPARK.

Create the dataframe as follows:

data = [
    (
        """{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}""",
        """{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}"""
    ),
    (
        """{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}""",
        """{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}"""
    )
]
df = spark.createDataFrame(data,schema=['u_group','t_group'])

Then use the from_json() to parse the dictionary and fetch the individual values as follows:

from pyspark.sql.types import *
from pyspark.sql.functions import *


schema_column = StructType([
    StructField("link",StringType(),True), 
    StructField("value",StringType(),True), 
])


df = df .withColumn('U_GROUP_PARSE',from_json(col('u_group'),schema_column))\
        .withColumn('T_GROUP_PARSE',from_json(col('t_group'),schema_column))\
        .withColumn('U_GROUP.LINK',col("U_GROUP_PARSE.link"))\
        .withColumn('U_GROUP.VALUE',col("U_GROUP_PARSE.value"))\
        .withColumn('T_GROUP.LINK',col("T_GROUP_PARSE.link"))\
        .withColumn('T_GROUP.VALUE',col("T_GROUP_PARSE.value"))\
        .drop('u_group','t_group','U_GROUP_PARSE','T_GROUP_PARSE')

Print the dataframe
```
 df.show(truncate=False)
```

Please check the below image for your reference: