I have two dfs.
df1 = pd.DataFrame(["bazzar","dogsss","zxvfzx","anythi"], columns = [0], index = [0,1,2,3])
df2 = pd.DataFrame(["baar","maar","cats","$%&*"], columns = [0], index = [0,1,2,3])
df1 = df1[0].apply(lambda x: pd.Series(list(x)))
df2 = df2[0].apply(lambda x: pd.Series(list(x)))
which look like
df1
0 1 2 3 4 5
0 b a z z a r
1 d o g s s s
2 z x v f z x
3 a n y t h i
df2
0 1 2 3
0 b a a r
1 m a a r
2 c a t s
3 $ % & *
I want to compare their first rows and make them identical by inserting new columns containing the character z
to df2
, so that df2
becomes
0 1 2 3 4 5
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
An additional example:
df3 = pd.DataFrame(["aazzbbzcc","bbbbbbbbb","ccccccccc","ddddddddd"], columns = [0], index = [0,1,2,3])
df4 = pd.DataFrame(["aabbcc","111111","222222","333333"], columns = [0], index = [0,1,2,3])
df3 = df3[0].apply(lambda x: pd.Series(list(x)))
df4 = df4[0].apply(lambda x: pd.Series(list(x)))
df3
0 1 2 3 4 5 6 7 8
0 a a z z b b z c c
1 b b b b b b b b b
2 c c c c c c c c c
3 d d d d d d d d d
df4
0 1 2 3 4 5
0 a a b b c c
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
You can see, an important relationship between the first rows of the two dataframes: they will eventually become the same when character z
are added to the later dataframe (i.e. df2
and df4
), so that the expected output for this example is:
0 1 2 3 4 5 6 7 8
0 a a z z b b z c c
1 1 1 z z 1 1 z 1 1
2 2 2 z z 2 2 z 2 2
3 3 3 z z 3 3 z 3 3
Any idea how to do that?
CodePudding user response:
Because in first rows are duplicated values are create MultiIndex
with first rows and GroupBy.cumcount
for both DataFrames:
a = df1.iloc[[0]].T
df1.columns = [a[0], a.groupby(a[0]).cumcount()]
b = df2.iloc[[0]].T
df2.columns = [b[0], b.groupby(b[0]).cumcount()]
print (df1)
0 b a z a r
0 0 0 1 1 0
0 b a z z a r
1 d o g s s s
2 z x v f z x
3 a n y t h i
print (df2)
0 b a r
0 0 1 0
0 b a a r
1 m a a r
2 c a t s
3 $ % & *
And then is used DataFrame.reindex
with replace missing values by first row of df1
:
df = df2.reindex(df1.columns, axis=1).fillna(df1.iloc[0])
print (df)
0 b a z a r
0 0 0 1 1 0
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
Last set range to columns:
df.columns = range(len(df.columns))
print (df)
0 1 2 3 4 5
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
CodePudding user response:
Check where to add:
list(difflib.ndiff(df2[0][0], df1[0][0]))
[' b', ' a', ' z', ' z', ' a', ' r']
Add manually
df2[0].str.replace('(.){2}', '\\1zz', regex = True).str.split('(?<=\\S)(?=\\S)', expand = True)
Out[1557]:
0 1 2 3 4 5
0 a z z r z z
1 a z z r z z
2 a z z s z z
3 % z z * z z