I have two columns of two different dataframes. The following chunks are the first 5 rows, but each column is much longer:
A = pd.DataFrame(['30-34', '20-24', '20-24', '15-19', '00-04'])
and
B = pd.DataFrame(['6','4', '4', '3', '0'])
I want to check whether both columns coincide, considering that
0 represents 00-04;
1 represents 05-09;
2 represents 10-14;
3 represents 15-19;
4 represents 20-24;
5 represents 25-29;
and 6 represents 30-34.
The desired output would be the number of non-matching elements. In the sample given, the desired output is "0", because the first 5 values of both columns are matching. I would give an approach that I tried but I have absolutely no idea.
CodePudding user response:
IIUC, you have ranges, 5 by 5, and you want to match them to their integer division.
(B.astype(int).values == A[0].str.split('-', expand=True).astype(int)//5).all(axis=1)
output:
0 True
1 True
2 True
3 True
4 True
Check if the columns coincide:
(B.astype(int).values ==
A[0].str.split('-', expand=True).astype(int)//5
).all(axis=1).all()
output: True
Intermediate steps:
# split on "-"
>>> A[0].str.split('-', expand=True)
0 1
0 30 34
1 20 24
2 20 24
3 15 19
4 00 04
# get integer division
>>> A[0].str.split('-', expand=True).astype(int)//5
0 1
0 6 6
1 4 4
2 4 4
3 3 3
4 0 0
# check if equals B
>>> B.astype(int).values == A[0].str.split('-', expand=True).astype(int)//5
0 1
0 True True
1 True True
2 True True
3 True True
4 True True
CodePudding user response:
You can map the strings to integers with map
, then compare with the other dataframe and count the number of non-matching items:
import pandas as pd
import io
data ='''0 30-34
1 20-24
2 20-24
3 15-19
4 00-04'''
data1 = '''0 6
1 4
2 4
3 3
4 0'''
df = pd.read_csv(io.StringIO(data), sep='\s ', names=['idx', 'string'])
df1 = pd.read_csv(io.StringIO(data1), sep='\s ', names=['idx', 'value'])
df['value'] = df['string'].map({'00-04': 0, '05-09':1, '10-14':2, '15-19':3, '20-24':4, '25-30':5, '30-34':6})
sum(df['value'] != df1['value'])