How do I search through a column of lists in python pandas and return the item in the list as well a-CodePudding

I have made a pandas data frame where I have two main columns, one with a job name the other with a SQL script.

I need to extract tables ending in '_REP', I have split the script into lists of the words (note the csv doesn't have commas for the script in it originally) and need to return the EXCTRATION and the actual tables that have '_REP' at the end, if there is no _REP table it should return noRepTable or something to indicate there is no _REP. The result needs to be a csv as well.


df = pd.read_csv("requests.csv")
df["sql_split"]= df["sql"].str.split(" ")

EXTRACTION  ...                                          sql_split
0  AU01     ...  [SELECT, COLUMN1, AS, CONNECTION_ID, FROM, TABLE_REP]
1  AU04     ...  [SELECT, COLUMN2, AS, EVENT_ACTION, FROM, TABLE2_REP]
2  AU05     ...  [SELECT, COLUMN1, AS, ID, FROM, TABLE_DB]


expected result:

AU01,TABLE_REP
AU04,TABLE2_REP
AU05,noRepTable

CodePudding user response：

Instead of splitting on the " " character and using that to get at the "_REP", you could try using a regex search function to find the specific pattern.

Your regex expression would attempt to find any pattern of strings that starts with a space ([ ] in the pattern), has some non-spaced characters in the middle ([A-Z0-9] in the pattern), and ends with an "_REP" ((_REP) in the pattern). This pattern should also be located at the end of your SQL query, hence the $ at the end. You can remove this from the expression, and add in another [ ] marker if the table name can be in the middle of your SQL query.

It would look like this:

import re

pattern = r"[ ][A-Z0-9] (_REP)$" # this is the pattern you're looking for

# Get an object that matches what you're looking for or gives you a None
df["rep_checks"]= df["sql"].apply(lambda x: re.search(pattern, x)) 

# Extract the useful table name and remove unnecessary spaces, or insert "noRepTable"
df["rep_checks"] = df["rep_checks"].apply(lambda x: x.group().strip() is x is not None else 'noRepTable')

Check out the specific documentation on re.search too!

CodePudding user response：

I noticed that you want te extract the table name, which in SQL occurs after "FROM". So my idea, for each row, is to:

find "FROM" element in sql_split,
get the table name (the next element),
return either the table name (if it ends with "_REP") or "noRepTable", along with EXTRACTION attribute.

To do it, define a function to be applied to each row:

def myConv(row):
    tbl = row.sql_split
    ind = tbl.index('FROM')
    tblName = tbl[ind   1]
    return row.EXTRACTION, tblName if tblName.endswith('_REP') else 'noRepTable'

Then apply it:

result = df.apply(myConv, axis=1, result_type='expand')

The result is:

      0           1
0  AU01   TABLE_REP
1  AU04  TABLE2_REP
2  AU05  noRepTable

So far column names are consecutive numbers, but if you want you can rename them any way you wish.

Notice: I assumed that each source row contains a single table name ending with "_REP". If your case is more complicated, you should state precisely all such details.

CodePudding user response：

Use Series.str.extract with df.fillna:

In [1708]: df['final'] = df['sql'].str.extract(r'(\w*_REP)').fillna('noRepTable')

In [1711]: df
Out[1711]: 
  EXTRACTION       final
0       AU01   TABLE_REP
1       AU04  TABLE2_REP
2       AU05  noRepTable