Utf-8 on windows python -
i have html file read parse etc, it's encode on unicode (i saw notepad) when tried
infile = open("path", "r") infile.read()
it fails , had famous error :
unicodeencodeerror: 'charmap' codec can't encode characters in position xx: character maps undefined
so test tried copy paste contain of file in new 1 , save in utf-8 , tried open codecs :
infile = codecs.open("path", "r", encoding="utf-8") outputstream = infile.read()
but error message :
unicodeencodeerror : 'charmap' codec can't encode character u'\ufeff' in position 0: charcater maps undefined
i don't understand because created file in utf8.
unicodeencodeerror
suggests code fails while encoding unicode text bytes i.e., actual code tries print windows console. see python, unicode, , windows console.
the link above fixes unicodeencodeerror
. next issue find out character encoding used text in "path"
file. if notepad.exe
shows text correctly means either encoded using locale.getprefferedencoding(false)
(something cp1252
on windows) or file has bom.
if sure encoding utf-8 pass open()
directly. don't use codecs.open()
:
with open('path', encoding='utf-8') file: html = file.read()
sometimes, input may contain text encoded using multiple (inconsistent) encodings e.g., smart quotes may encoded using cp1252
while rest of html utf-8 -- fix using bs4.unicodedammit
. see a way charset/encoding of http response in python
Comments
Post a Comment