Utf-8 on windows python -


i have html file read parse etc, it's encode on unicode (i saw notepad) when tried

infile = open("path", "r")  infile.read() 

it fails , had famous error :

unicodeencodeerror: 'charmap' codec can't encode characters in position xx: character maps undefined

so test tried copy paste contain of file in new 1 , save in utf-8 , tried open codecs :

infile = codecs.open("path", "r", encoding="utf-8") outputstream = infile.read() 

but error message :

unicodeencodeerror : 'charmap' codec can't encode character u'\ufeff' in position 0: charcater maps undefined

i don't understand because created file in utf8.

unicodeencodeerror suggests code fails while encoding unicode text bytes i.e., actual code tries print windows console. see python, unicode, , windows console.


the link above fixes unicodeencodeerror. next issue find out character encoding used text in "path" file. if notepad.exe shows text correctly means either encoded using locale.getprefferedencoding(false) (something cp1252 on windows) or file has bom.

if sure encoding utf-8 pass open() directly. don't use codecs.open():

with open('path', encoding='utf-8') file:     html = file.read() 

sometimes, input may contain text encoded using multiple (inconsistent) encodings e.g., smart quotes may encoded using cp1252 while rest of html utf-8 -- fix using bs4.unicodedammit. see a way charset/encoding of http response in python


Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -