解决json.loads中文解析异常

json.loads时报异常

1
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x** in position **: invalid continuation byte

一般是因为编码中有中文,并且和默认的解码方式(utf-8)不匹配造成的,在中国来说通常用最常见的非utf-8编码就是gb2312。(如果你知道里面包含了日语那么则应该尝试按Shift_JIS解码而不是gb2312,等等),另外如果实在解不出,有时候实在解不出或许也可以丢弃,比如注释中的文字。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import re
import json

# 删除非ascii的编码
def omit_unascii(data):
new_data = b''
p = 0
m = re.search(b"[^\x00-\x7F]+", data[p:])
while m:
new_data += data[p:p+m.start()]
p += m.end()
m = re.search(b"[^\x00-\x7F]+", data[p:])
new_data += data[p:]
return new_data


def json_loads(msg):
try:
# 先尝试正常解析(按UTF8解码)
obj = json.loads(msg)
except Exception as e:
try:
# 尝试先按GBK解码再解析
obj = json.loads(msg.decode("gb2312", "ignore"))
except Exception as e:
try:
# 尝试删除所有非ascii编码后再解析
obj = json.loads(omit_unascii(msg))
except Exception as e:
raise e
return obj