Home > Mobile >  how to create a new data frame by using substring and matching column values in Python Pandas
how to create a new data frame by using substring and matching column values in Python Pandas

Time:05-06

Suppose I have a simple dataframe where I have four features as food, kitchen, city, and detail.

d = {'Food': ['P1|0', 'P2', 'P3|45', 'P1', 'P2', 'P4', 'P1|1', 'P3|7', 'P5', 'P1||23'], 
     'Kitchen' : ['L1', 'L2','L9', 'L4','L5', 'L6','L1', 'L9','L10', 'L1'],
     'City': ['A', 'A', 'A', 'B', 'B','B', 'C', 'C', 'C','D'],
     'Detail': ['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9','d0']}
df = pd.DataFrame(data=d)

My goal is to use the substring of Food value without | and create a new dataframe where I can see which kitchens do produce similar foods. The way I define similarity is that substring should match with respect to Kitchen.

df['Food'] = df['Food'].apply(str)

df.insert(0,'subFood',df['Food'].str.split('|').str[0])
df.iloc[: , :2]
subFood Food
0   P1  P1|0    
1   P2  P2  
2   P3  P3|45   
3   P1  P1  
4   P2  P2  
5   P4  P4  
6   P1  P1|1    
7   P3  P3|7    
8   P5  P5  
9   P1  P1||23  

To do so, I use merge function together with query.

df.merge(df, on=['subFood', 'Kitchen'], suffixes=('_1', '_2')).query('City_1 != City_2')

subFood Food_1  Kitchen City_1  Detail_1    Food_2  City_2  Detail_2
1   P1  P1|0    L1  A   d1  P1|1    C   d7
2   P1  P1|0    L1  A   d1  P1||23  D   d0
3   P1  P1|1    L1  C   d7  P1|0    A   d1
5   P1  P1|1    L1  C   d7  P1||23  D   d0
6   P1  P1||23  L1  D   d0  P1|0    A   d1
7   P1  P1||23  L1  D   d0  P1|1    C   d7
11  P3  P3|45   L9  A   d3  P3|7    C   d8
12  P3  P3|7    L9  C   d8  P3|45   A   d3

I got stuck here. My intention is to have a dataframe that should look similar to the dataframe shown below. I appreciate any help and / or hint.

subFood Food_1  Food_2 Kitchen City Detail
P1       P1|0    P1|0    L1       A   d1
P1       P1|0    P1|1    L1       C   d1  
....

CodePudding user response:

IIUC, you can split each row into two rows by combining the city names to a list and then using explode:

merged = df.merge(df, on=["subFood","Kitchen"], suffixes=("_1","_2")).query("City_1 != City_2")
merged["City"] = merged[["City_1","City_2"]].to_numpy().tolist()
output = merged.drop(["City_1","City_2","Detail_2"],axis=1).explode("City").rename(columns={"Detail_1":"Detail"})

>>> output
   subFood  Food_1 Kitchen Detail  Food_2 City
1       P1    P1|0      L1     d1    P1|1    A
1       P1    P1|0      L1     d1    P1|1    C
2       P1    P1|0      L1     d1  P1||23    A
2       P1    P1|0      L1     d1  P1||23    D
3       P1    P1|1      L1     d7    P1|0    C
3       P1    P1|1      L1     d7    P1|0    A
5       P1    P1|1      L1     d7  P1||23    C
5       P1    P1|1      L1     d7  P1||23    D
6       P1  P1||23      L1     d0    P1|0    D
6       P1  P1||23      L1     d0    P1|0    A
7       P1  P1||23      L1     d0    P1|1    D
7       P1  P1||23      L1     d0    P1|1    C
11      P3   P3|45      L9     d3    P3|7    A
11      P3   P3|45      L9     d3    P3|7    C
12      P3    P3|7      L9     d8   P3|45    C
12      P3    P3|7      L9     d8   P3|45    A
  • Related