Home > OS >  stripping one column to create a new column through query
stripping one column to create a new column through query

Time:07-12

I have a large pandas data frame where one of the columns is wrapped in double quotes and some of the rows have multiple values, so they are comma separated. I want to extract these values from the names column, by stripping the quotes and then iterate through each name (we don't have to skip the blank ones since the query result will just be blank) and get a query result that will be placed in a new column, also comma separated. The other thing is that I don't want any duplicates in my query results listed. I don't need help with how to get the query results, more so the other parts in between like the stripping of quotes and then making sure the results are placed in the correct places with no duplicates.

(simplified) original pandas dataset

 ID1     ID2       names         
 01      01         "John"         
 01      02         "Kate, Ashten"       
 01      03        
 01      04         "Emily, Cathy, Joy"       

desired dataset

 ID1     ID2       names                  query_result_fav_color         
 01      01         "John"                    "pink"    
 01      02         "Kate, Ashten"            "blue"
 01      03
 01      04         "Emily, Cathy, Joy"       "red, green, blue"

what I've tried

values=original_df['names'] # series 
values=values.to_frame()
values = values.apply(lambda x: x.str.replace('"',''))


#do the query (get a 'results' list)

results= ["pink", "blue", "blue", "NA", "red", "green", "blue"]

#put the results from the query in a new column that corresponds to the correct row (need help with this)
 ID1     ID2       names                  query_result_fav_color         
 01      01         "John"                    "pink"    
 01      02         "Kate, Ashten"            "blue", "blue"
 01      03                                   "NA"
 01      04         "Emily, Cathy, Joy"       "red", "green", "blue"

# remove duplicates in same row, and make sure only 1 set of quotes around the results (need help with this)


CodePudding user response:

You can remove duplicate values from the list if the size of names and query_result_fav_color do not match.

Creating the data

df1 = pd.DataFrame({
    'ID1' : ['01','01','01','01'],
    'ID2' : ['01', '02', '03', '04'],
    'names' : ['"John"', '"Kate, Ashten"', "", '"Emily, Cathy, Joy"'],
    'query_result_fav_color' : ['"pink"', '"blue", "blue"', '"NA"', '"red", "green", "blue"']
    })

This gives us your data

  ID1 ID2                names  query_result_fav_color
0  01  01               "John"                  "pink"
1  01  02       "Kate, Ashten"          "blue", "blue"
2  01  03                                         "NA"
3  01  04  "Emily, Cathy, Joy"  "red", "green", "blue"

Pre-processing

Removing quotes from the data

df1 = df1.apply(lambda s:s.str.replace('"', ""))

Removing duplicates

from collections import OrderedDict
df1['query_result_fav_color'] = df1.apply(lambda s: ','.join(list(OrderedDict.fromkeys(s.query_result_fav_color.replace(' ','').split(','))))
          if (len(s.query_result_fav_color.replace(' ','').split(',')) == (s.names.count(',')   len(s.names[-1:]))) else "", axis=1)

Output :

This gives us the expected output :

df1
  ID1 ID2              names query_result_fav_color
0  01  01               John                   pink
1  01  02       Kate, Ashten                   blue
2  01  03                                          
3  01  04  Emily, Cathy, Joy         red,green,blue

CodePudding user response:

Assuming that you already replaced '"' with '' (not necessary for the code to run) in your frame so that you have the following data frame and list:

import pandas as pd

df = pd.DataFrame({"ID1": ["01"] * 4, "ID2": ["01", "02", "03", "04"], "names": ["John", "Kate, Ashton", "", "Emily, Cathy, Joy"]})
results= ["pink", "blue", "blue", "NA", "red", "green", "blue"]

You can convert the values from 'names' to lists and explode your data frame. Then you just have to assign your list to the exploded frame (which matches the length of the list now) and re-group it back by your 'ID' columns. For this, you do not really need another package besides of pandas. Just do something like this:

df = df.assign(names = df["names"].str.split(",")).explode(column="names").assign(query_result_fav_color=results).groupby(["ID1", "ID2"])["names", "query_result_fav_color"].agg(list)

df.assign(
    names=df["names"].apply(lambda x: ", ".join(x) if x else ""),
    query_result_fav_color=df["query_result_fav_color"].apply(lambda x: ", ".join(x))
).reset_index()

-------------------------------------------------
    ID1 ID2 names               query_result_fav_color
0   01  01  John                pink
1   01  02  Kate, Ashton        blue, blue
2   01  03                      NA
3   01  04  Emily, Cathy, Joy   red, green, blue
-------------------------------------------------

If you want to have an one-liner that does this, here you go:

(
    df.assign(query_result_fav_color=df.assign(names = df["names"].str.split(","))
           .explode(column="names").drop(columns=["names"])
           .assign(query_result_fav_color=results)
           .groupby(["ID1", "ID2"])["query_result_fav_color"]
           .agg(list).apply(lambda x: ", ".join(x) if x else "").values)
)
  • Related