Home > Mobile >  How to partially match a list and write matched characters in a data frame in python
How to partially match a list and write matched characters in a data frame in python

Time:12-09

I have two data frames df1 and df2. I want to match those two so that df two values match to one column of df1 and show up in a row. Here is a sample data I made

import pandas as p`enter code here`d

# initialize list of lists
data = [["AA", 'ABC_111' ], ["BB", 'ABC_112'], ["CC", 'ABC_113']]
data1= [['ABC_111_12'], ['ABC_112_45'], ['ABC_112_89'],['ABC_113_06'], ['ABC_113_25'], ['ABC_113_89']]
result= [['AA' ,'ABC_111', 'ABC_111_12','ABC_111_19'], ['BB', 'ABC_112', "ABC_112_45",'ABC_112_89' ],
         ['CC','ABC_113', 'ABC_113_89','ABC_113_06', 'ABC_113_25', 'ABC_113_29']]

# Create the pandas DataFrame
df1= pd.DataFrame(data, columns = [0, 1])
df2= pd.DataFrame(data1, columns = [0])
result_df = pd.DataFrame(result, columns = [0, 1, 2, 3, 4,5])

# print dataframe.
print("df1: \n",df1)


    print("df2: \n",df2)
    
    print("expected_result: \n",result_df)


df1: 
     0        1
0  AA  ABC_111
1  BB  ABC_112
2  CC  ABC_113

df2: 
             0
0  ABC_111_12
1  ABC_112_45
2  ABC_112_89
3  ABC_113_06
4  ABC_113_25
5  ABC_113_89

So my expected result is something like this:

expected_result: 
     0        1           2           3           4           5
0  AA  ABC_111  ABC_111_12  ABC_111_19        None        None
1  BB  ABC_112  ABC_112_45  ABC_112_89        None        None
2  CC  ABC_113  ABC_113_89  ABC_113_06  ABC_113_25  ABC_113_29

CodePudding user response:

This works for the data provided.

  1. Split the data into the 'root' and the remaining value with rsplit()
  2. Use groupby and agg to put the remaining values in a list
  3. expand the lists to columns
  4. Concatenate with df1

Note: I think there's a better way to do 3 and 4, but this works

df2= pd.DataFrame(data1, columns = [0])
df2['cola'] = df2[0].str.rsplit('_', 1, expand=True)[0]
df3 = df2.groupby('cola').agg(list).reset_index()
df4 = pd.DataFrame(df3[0].tolist(), index= df3.index)
pd.concat([df1,df4], axis=1)

    0        1           0           1           2
0  AA  ABC_111  ABC_111_12        None        None
1  BB  ABC_112  ABC_112_45  ABC_112_89        None
2  CC  ABC_113  ABC_113_06  ABC_113_25  ABC_113_89

CodePudding user response:

I'm going to answer assuming the structure of the text entries are indicative of the real data you want to work with, as it is an important part of how I would go about solving this.

The most important thing here is isolating which bits of text in each value are static and which are variable.

If they are something like:

AAA_NNN_NN

A=alpha
N=numeric

The length of As or Ns can be variable, so long as the number of _ is static. If there are always 2 in the data1 list, then we are in business.

There are a few ways to go about this, and the ultimate structure would depend on how well you know the data and what shortcuts you can engineer into the solution, but a slow method would be to do some exact matching on string splits.

results = []
for i,j in data:
    tmp = [i, j]
    for k in data1:
        h = k[0].split("_")
        if h[1] in j:
            tmp.append(k[0])
        else:
            tmp.append(None)
    results.append(tmp)

for i in results:
    print(i)

The if h[1] in j: would see if '111' is in the string 'ABC_111' in the first case, which it is.

This is very crude, but it should give you an idea of exact matching within structured strings. The importance here is on the structure. There are many ways to match things, but a lot of it comes down to the data you are working with.

I hope this helps guide you to a solution.

CodePudding user response:

Try this using rsplit, groupby, cumcount, set_index and unstack:

dfm = (df2.assign(keystr=df2[0].str.rsplit('_',1).str[0])
          .merge(df1, left_on='keystr', right_on=1))
df_out = (dfm.set_index(['0_y', 
                        'keystr', 
                        dfm.groupby(['0_y','keystr'])['0_x'].cumcount()])['0_x']
            .unstack().reset_index())
print(df_out)

Output:

  0_y   keystr           0           1           2
0  AA  ABC_111  ABC_111_12         NaN         NaN
1  BB  ABC_112  ABC_112_45  ABC_112_89         NaN
2  CC  ABC_113  ABC_113_06  ABC_113_25  ABC_113_89
  • Related