Given a test data as follows:
id object
0 1 ABS
1 2 B2
2 3 H型钢
3 4 IP
4 5 12
5 6 豆2
6 7 φ
Or:
[{'id': 1, 'object': 'ABS'},
{'id': 2, 'object': 'B2'},
{'id': 3, 'object': 'H型钢'},
{'id': 4, 'object': 'IP'},
{'id': 5, 'object': '12'},
{'id': 6, 'object': '豆2'},
{'id': 7, 'object': 'φ'}]
How could I filter out rows which which contains chinese characters in object
column?
The expected result:
id object
0 3 H型钢
1 6 豆2
CodePudding user response:
You can use the regex
module that understands the unicode block definitions (in the case of Chinese (or CJK) the \p{IsHan}
block):
import regex
reg = regex.compile(r'.*\p{IsHan}')
df[~df['object'].apply(lambda x: bool(reg.match(x)))]
output:
id object
0 1 ABS
1 2 B2
3 4 IP
4 5 12
6 7 φ
CodePudding user response:
Try with unicode characters:
>>> df.loc[df['object'].str.contains('[\u4e00-\u9fff] ')]
id object
2 3 H型钢
5 6 豆2
>>>