Filter rows including Chinese characters from one column with Pandas [duplicate]-CodePudding

Given a test data as follows:

   id object
0   1    ABS
1   2     B2
2   3    H型钢
3   4     IP
4   5     12
5   6     豆2
6   7      φ

Or:

[{'id': 1, 'object': 'ABS'},
 {'id': 2, 'object': 'B2'},
 {'id': 3, 'object': 'H型钢'},
 {'id': 4, 'object': 'IP'},
 {'id': 5, 'object': '12'},
 {'id': 6, 'object': '豆2'},
 {'id': 7, 'object': 'φ'}]

How could I filter out rows which which contains chinese characters in object column?

The expected result:

   id object
0   3    H型钢
1   6     豆2

CodePudding user response：

You can use the regex module that understands the unicode block definitions (in the case of Chinese (or CJK) the \p{IsHan} block):

import regex
reg = regex.compile(r'.*\p{IsHan}')
df[~df['object'].apply(lambda x: bool(reg.match(x)))]

output:

   id object
0   1    ABS
1   2     B2
3   4     IP
4   5     12
6   7      φ

CodePudding user response：

Try with unicode characters:

>>> df.loc[df['object'].str.contains('[\u4e00-\u9fff] ')]
   id object
2   3    H型钢
5   6     豆2
>>>