I have two data frames df1 and df2. I want to match those two so that df two values match to one column of df1 and show up in a row. Here is a sample data I made
import pandas as p`enter code here`d
# initialize list of lists
data = [["AA", 'ABC_111' ], ["BB", 'ABC_112'], ["CC", 'ABC_113']]
data1= [['ABC_111_12'], ['ABC_112_45'], ['ABC_112_89'],['ABC_113_06'], ['ABC_113_25'], ['ABC_113_89']]
result= [['AA' ,'ABC_111', 'ABC_111_12','ABC_111_19'], ['BB', 'ABC_112', "ABC_112_45",'ABC_112_89' ],
['CC','ABC_113', 'ABC_113_89','ABC_113_06', 'ABC_113_25', 'ABC_113_29']]
# Create the pandas DataFrame
df1= pd.DataFrame(data, columns = [0, 1])
df2= pd.DataFrame(data1, columns = [0])
result_df = pd.DataFrame(result, columns = [0, 1, 2, 3, 4,5])
# print dataframe.
print("df1: \n",df1)
print("df2: \n",df2)
print("expected_result: \n",result_df)
df1:
0 1
0 AA ABC_111
1 BB ABC_112
2 CC ABC_113
df2:
0
0 ABC_111_12
1 ABC_112_45
2 ABC_112_89
3 ABC_113_06
4 ABC_113_25
5 ABC_113_89
So my expected result is something like this:
expected_result:
0 1 2 3 4 5
0 AA ABC_111 ABC_111_12 ABC_111_19 None None
1 BB ABC_112 ABC_112_45 ABC_112_89 None None
2 CC ABC_113 ABC_113_89 ABC_113_06 ABC_113_25 ABC_113_29
CodePudding user response:
This works for the data provided.
- Split the data into the 'root' and the remaining value with
rsplit()
- Use
groupby
andagg
to put the remaining values in a list - expand the lists to columns
- Concatenate with df1
Note: I think there's a better way to do 3 and 4, but this works
df2= pd.DataFrame(data1, columns = [0])
df2['cola'] = df2[0].str.rsplit('_', 1, expand=True)[0]
df3 = df2.groupby('cola').agg(list).reset_index()
df4 = pd.DataFrame(df3[0].tolist(), index= df3.index)
pd.concat([df1,df4], axis=1)
0 1 0 1 2
0 AA ABC_111 ABC_111_12 None None
1 BB ABC_112 ABC_112_45 ABC_112_89 None
2 CC ABC_113 ABC_113_06 ABC_113_25 ABC_113_89
CodePudding user response:
I'm going to answer assuming the structure of the text entries are indicative of the real data you want to work with, as it is an important part of how I would go about solving this.
The most important thing here is isolating which bits of text in each value are static and which are variable.
If they are something like:
AAA_NNN_NN
A=alpha
N=numeric
The length of As or Ns can be variable, so long as the number of _ is static. If there are always 2 in the data1 list, then we are in business.
There are a few ways to go about this, and the ultimate structure would depend on how well you know the data and what shortcuts you can engineer into the solution, but a slow method would be to do some exact matching on string splits.
results = []
for i,j in data:
tmp = [i, j]
for k in data1:
h = k[0].split("_")
if h[1] in j:
tmp.append(k[0])
else:
tmp.append(None)
results.append(tmp)
for i in results:
print(i)
The if h[1] in j:
would see if '111'
is in the string 'ABC_111'
in the first case, which it is.
This is very crude, but it should give you an idea of exact matching within structured strings. The importance here is on the structure. There are many ways to match things, but a lot of it comes down to the data you are working with.
I hope this helps guide you to a solution.
CodePudding user response:
Try this using rsplit
, groupby
, cumcount
, set_index
and unstack
:
dfm = (df2.assign(keystr=df2[0].str.rsplit('_',1).str[0])
.merge(df1, left_on='keystr', right_on=1))
df_out = (dfm.set_index(['0_y',
'keystr',
dfm.groupby(['0_y','keystr'])['0_x'].cumcount()])['0_x']
.unstack().reset_index())
print(df_out)
Output:
0_y keystr 0 1 2
0 AA ABC_111 ABC_111_12 NaN NaN
1 BB ABC_112 ABC_112_45 ABC_112_89 NaN
2 CC ABC_113 ABC_113_06 ABC_113_25 ABC_113_89