Pandas dataframe, checking if row has exceptions and creating multiple string variables depending if-CodePudding

I'm stuck on figuring this out. I have the following df I created from output. I'm looping through to get the table_name, schema, column_name and comment. I want to make sure there is a comment present, and if there is create a string query that appends to the list query_list.

output =  [['table_name', 'schema_name', 'column_name', 'data_type', 'null?', 'default', 'kind', 'expression', 'comment', 'database_name', 'autoincrement'], ['ACCOUNT', 'SFO', '_LOAD_DATETIME', '{"type":"TIMESTAMP_LTZ","precision":0,"scale":9,"nullable":true}', 'TRUE', '', 'COLUMN', '', 'DATE of Account', 'VE'], ['ACCOUNT', 'SFO', '_LOAD_FILENAME', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', 'file name', 'VE'], ['ACCOUNT', 'SFO', '_LOAD_FILE_TIMESTAMP', '{"type":"TIMESTAMP_NTZ","precision":0,"scale":9,"nullable":true}', 'TRUE', '', 'COLUMN', '', '', 'VE'], ['CUSTOMER', 'SFO', 'SUBSCRIPTIONSLIST', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'VE'], ['CUSTOMER', 'SFO', 'CONTACTROLESLIST', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', 'list of contract', 'VE'], ['DATA', 'SFO', 'OPPORTUNITY_NAME', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'VE']]

output = filter(bool, output)
df = pd.DataFrame(output)


df.columns = df.iloc[0]
df = df[1:]

query_list = []
grouped_comments = ''

for index, row in df.iterrows():
    if row['comment'] is not None and row['comment'] != '':
        
        if len(row['table_name']) > 1:
            # the below doesn't work, it groups all table comments together

            sql = f"(COLUMN {row['column_name']} COMMENT '{row['comment']}')"
            grouped_comments = grouped_comments   sql

        elif len(row['table_name']) == 1:
            sql = f"ALTER TABLE {row['schema_name']}.{row['table_name']} ALTER COLUMN {row['column_name']} COMMENT '{row['comment']}';"

        query_list.append(sql)

Now the part I'm stuck on getting, is if there is a comment present and if the table_name shows up more then once then it should create a string that looks like the below, so it should be fetching all the column_name and comment and grouping into one string for that table_name:

"ALTER TABLE VE.ACCOUNT ALTER (COLUMN _LOAD_DATETIME COMMENT 'DATE of Account', COLUMN _LOAD_FILENAME COMMENT 'file name');"

and the elif works because if table_name only shows up once then it populates the correct string:

"ALTER TABLE VE.CUSTOMER ALTER COLUMN CONTACTROLESLIST COMMENT 'list of contract';"

so finally if I have the 2 above strings then my query_list should look like:

query_list = ["ALTER TABLE VE.ACCOUNT ALTER (COLUMN _LOAD_DATETIME COMMENT 'DATE of Account', COLUMN _LOAD_FILENAME COMMENT 'file name');",
 "ALTER TABLE VE.CUSTOMER ALTER COLUMN CONTACTROLESLIST COMMENT 'list of contract';"]

CodePudding user response：

First, you can filter out unnecessary rows.

df = df[df.comment.notnull() & (df.comment.str.len() > 0)]

Then, concatenate the column statement.

df['column_prop'] = 'COLUMN '   df.column_name   ' COMMENT \''   df.comment   '\''

Now, aggregate the column_prop to join the string, group by the table name.

df = df.groupby(['database_name', 'table_name']).agg({'column_prop': lambda x: ', '.join(x)}).reset_index()

This will give you the following.

0 database_name table_name                                                                                 column_prop
0            VE    ACCOUNT  COLUMN _LOAD_DATETIME COMMENT 'DATE of Account', COLUMN _LOAD_FILENAME COMMENT 'file name'
1            VE   CUSTOMER                                          COLUMN CONTACTROLESLIST COMMENT 'list of contract'

From here, you can concatenate the each columns with additional string to obtain your desired string.

If your dataframe is small, you can simply concatenate like this.

df['sql'] = 'ALTER TABLE '   df.database_name   '.'   ...

Or check here https://stackoverflow.com/a/54298586/2956135 for better string concatenation approach.