pandas compare first rows and make identical-CodePudding

I have two dfs.

df1 = pd.DataFrame(["bazzar","dogsss","zxvfzx","anythi"], columns = [0], index = [0,1,2,3])
df2 = pd.DataFrame(["baar","maar","cats","$%&*"], columns = [0], index = [0,1,2,3])

df1 = df1[0].apply(lambda x: pd.Series(list(x)))
df2 = df2[0].apply(lambda x: pd.Series(list(x)))

which look like

df1
    0   1   2   3   4   5
0   b   a   z   z   a   r
1   d   o   g   s   s   s
2   z   x   v   f   z   x
3   a   n   y   t   h   i


df2
    0   1   2   3
0   b   a   a   r
1   m   a   a   r
2   c   a   t   s
3   $   %   &   *

I want to compare their first rows and make them identical by inserting new columns containing the character z to df2, so that df2 becomes

    0   1   2   3   4   5
0   b   a   z   z   a   r
1   m   a   z   z   a   r
2   c   a   z   z   t   s
3   $   %   z   z   &   *

An additional example:

df3 = pd.DataFrame(["aazzbbzcc","bbbbbbbbb","ccccccccc","ddddddddd"], columns = [0], index = [0,1,2,3])
df4 = pd.DataFrame(["aabbcc","111111","222222","333333"], columns = [0], index = [0,1,2,3])

df3 = df3[0].apply(lambda x: pd.Series(list(x)))
df4 = df4[0].apply(lambda x: pd.Series(list(x)))

df3
    0   1   2   3   4   5   6   7   8
0   a   a   z   z   b   b   z   c   c
1   b   b   b   b   b   b   b   b   b
2   c   c   c   c   c   c   c   c   c
3   d   d   d   d   d   d   d   d   d

df4 
    0   1   2   3   4   5
0   a   a   b   b   c   c
1   1   1   1   1   1   1
2   2   2   2   2   2   2
3   3   3   3   3   3   3

You can see, an important relationship between the first rows of the two dataframes: they will eventually become the same when character z are added to the later dataframe (i.e. df2 and df4), so that the expected output for this example is:

    0   1   2   3   4   5   6   7   8
0   a   a   z   z   b   b   z   c   c
1   1   1   z   z   1   1   z   1   1
2   2   2   z   z   2   2   z   2   2
3   3   3   z   z   3   3   z   3   3

Any idea how to do that?

CodePudding user response：

Because in first rows are duplicated values are create MultiIndex with first rows and GroupBy.cumcount for both DataFrames:

a = df1.iloc[[0]].T
df1.columns = [a[0], a.groupby(a[0]).cumcount()]

b = df2.iloc[[0]].T
df2.columns = [b[0], b.groupby(b[0]).cumcount()]

print (df1)
0  b  a  z     a  r
   0  0  0  1  1  0
0  b  a  z  z  a  r
1  d  o  g  s  s  s
2  z  x  v  f  z  x
3  a  n  y  t  h  i

print (df2)
0  b  a     r
   0  0  1  0
0  b  a  a  r
1  m  a  a  r
2  c  a  t  s
3  $  %  &  *

And then is used DataFrame.reindex with replace missing values by first row of df1:

df = df2.reindex(df1.columns, axis=1).fillna(df1.iloc[0])
print (df)
0  b  a  z     a  r
   0  0  0  1  1  0
0  b  a  z  z  a  r
1  m  a  z  z  a  r
2  c  a  z  z  t  s
3  $  %  z  z  &  *

Last set range to columns:

df.columns = range(len(df.columns))
print (df)
   0  1  2  3  4  5
0  b  a  z  z  a  r
1  m  a  z  z  a  r
2  c  a  z  z  t  s
3  $  %  z  z  &  *

CodePudding user response：

Check where to add:

list(difflib.ndiff(df2[0][0], df1[0][0]))
['  b', '  a', '  z', '  z', '  a', '  r']

Add manually

df2[0].str.replace('(.){2}', '\\1zz', regex = True).str.split('(?<=\\S)(?=\\S)', expand = True)
Out[1557]: 
   0  1  2  3  4  5
0  a  z  z  r  z  z
1  a  z  z  r  z  z
2  a  z  z  s  z  z
3  %  z  z  *  z  z