How to correct TypeError: Unicode-objects must be encoded before hashing?

Question

I have this error:

Traceback (most recent call last):
  File "python_md5_cracker.py", line 27, in <module>
  m.update(line)
TypeError: Unicode-objects must be encoded before hashing

when I try to execute this code in Python 3.2.2:

import hashlib, sys
m = hashlib.md5()
hash = ""
hash_file = input("What is the file name in which the hash resides?  ")
wordlist = input("What is your wordlist?  (Enter the file name)  ")
try:
  hashdocument = open(hash_file, "r")
except IOError:
  print("Invalid file.")
  raw_input()
  sys.exit()
else:
  hash = hashdocument.readline()
  hash = hash.replace("\n", "")

try:
  wordlistfile = open(wordlist, "r")
except IOError:
  print("Invalid file.")
  raw_input()
  sys.exit()
else:
  pass
for line in wordlistfile:
  # Flush the buffer (this caused a massive problem when placed 
  # at the beginning of the script, because the buffer kept getting
  # overwritten, thus comparing incorrect hashes)
  m = hashlib.md5()
  line = line.replace("\n", "")
  m.update(line)
  word_hash = m.hexdigest()
  if word_hash == hash:
    print("Collision! The word corresponding to the given hash is", line)
    input()
    sys.exit()

print("The hash given does not correspond to any supplied word in the wordlist.")
input()
sys.exit()

I found opening a file with 'rb' helped my case. – dlamblin Nov 28 '17 at 07:06 — dlamblin, Nov 28 '17 at 07:06

cwallenpoole · Accepted Answer · 2021-01-14T15:19:09.313

367

It is probably looking for a character encoding from wordlistfile.

wordlistfile = open(wordlist,"r",encoding='utf-8')

Or, if you're working on a line-by-line basis:

line.encode('utf-8')

EDIT

Per the comment below and this answer.

My answer above assumes that the desired output is a str from the wordlist file. If you are comfortable in working in bytes, then you're better off using open(wordlist, "rb"). But it is important to remember that your hashfile should NOT use rb if you are comparing it to the output of hexdigest. hashlib.md5(value).hashdigest() outputs a str and that cannot be directly compared with a bytes object: 'abc' != b'abc'. (There's a lot more to this topic, but I don't have the time ATM).

It should also be noted that this line:

line.replace("\n", "")

Should probably be

line.strip()

That will work for both bytes and str's. But if you decide to simply convert to bytes, then you can change the line to:

line.replace(b"\n", b"")

edited Jan 14 '21 at 15:19

answered Sep 28 '11 at 15:10

cwallenpoole

72,280
22
119
159

3

`open(wordlist,"r",encoding='utf-8')` why use open with specific encoding, the encoding is specified the decode codec, without this option, it use platform-dependent encoding. – Tanky Woo Jan 19 '16 at 01:05
The first half of this is flat wrong, and it's shocking it got up-voted as high as it did. Specifying an `encoding` explicitly just changes how it decodes the bytes on disk to get a `str` (a text type storing arbitrary Unicode), but it would decode to `str` without that, and the problem is using `str` in the first place. The `line.encode('utf-8')` *undoes* that mistaken decoding, but the OP should just be opening the file in `'rb'` mode in the first place (with no encoding) so `line` is a `bytes` object in the first place (a few trivial changes needed to match, e.g. in `.replace("\n", '')`). – ShadowRanger Jan 13 '21 at 04:21
@ShadowRanger And if the OP *wants* a `str`? I added a bit to the answer, but my original reply was the short, sweet, and immediately available. It also happened to be the right answer for a project I was working on when I wrote the above reply, so `¯\_(ツ)_/¯` – cwallenpoole Jan 14 '21 at 15:20

score 152 · Answer 2 · edited May 05 '17 at 20:54

152

You must have to define encoding format like utf-8, Try this easy way,

This example generates a random number using the SHA256 algorithm:

>>> import hashlib
>>> hashlib.sha256(str(random.getrandbits(256)).encode('utf-8')).hexdigest()
'cd183a211ed2434eac4f31b317c573c50e6c24e3a28b82ddcb0bf8bedf387a9f'

edited May 05 '17 at 20:54

Community

1
1

answered Mar 19 '14 at 12:03

Jay Patel

23,885
12
63
74

score 33 · Answer 3 · answered Dec 16 '18 at 14:15

33

import hashlib
string_to_hash = '123'
hash_object = hashlib.sha256(str(string_to_hash).encode('utf-8'))
print('Hash', hash_object.hexdigest())

answered Dec 16 '18 at 14:15

Sabyasachi

1,144
10
16

hashlib.sha256 method always expected unicode. In Python-2 str was both str and unicode, thus just passing string_to_hash used to work just fine. However, in Python-3 string(text, here string_to_hash) and unicode are two different types. So when we pass just string_to_hash(which is of type text), it throws error stating a unicode value is required. – kundan Oct 29 '20 at 20:17

score 19 · Answer 4 · answered Sep 11 '17 at 09:09

19

To store the password (PY3):

import hashlib, os
password_salt = os.urandom(32).hex()
password = '12345'

hash = hashlib.sha512()
hash.update(('%s%s' % (password_salt, password)).encode('utf-8'))
password_hash = hash.hexdigest()

answered Sep 11 '17 at 09:09

Khắc Nghĩa Từ

283
2
7

1

This line makes the password impossible to use. password_salt = os.urandom(32).hex() It should a fixed known value but it can be secret for server only. Please correct me or adapt it to your code. – Yash Dec 12 '18 at 15:51
1

I agree with @Yash You either have a single salt you use for every hash (not the best), or if you generate a random salt for each hash, you must store it with the hash to use again later for comparison – Carson Evans Jan 09 '19 at 18:14

score 17 · Answer 5 · answered Sep 28 '11 at 15:09

17

The error already says what you have to do. MD5 operates on bytes, so you have to encode Unicode string into bytes, e.g. with line.encode('utf-8').

answered Sep 28 '11 at 15:09

Cat Plus Plus

113,388
26
185
215

score 13 · Answer 6 · edited May 23 '17 at 11:47

13

Please take a look first at that answer.

Now, the error message is clear: you can only use bytes, not Python strings (what used to be unicode in Python < 3), so you have to encode the strings with your preferred encoding: utf-32, utf-16, utf-8 or even one of the restricted 8-bit encodings (what some might call codepages).

The bytes in your wordlist file are being automatically decoded to Unicode by Python 3 as you read from the file. I suggest you do:

m.update(line.encode(wordlistfile.encoding))

so that the encoded data pushed to the md5 algorithm are encoded exactly like the underlying file.

edited May 23 '17 at 11:47

Community

1
1

answered Oct 15 '11 at 14:14

tzot

81,264
25
129
197

Why decode only to reencode when you could just process the file in binary mode and deal with `bytes` the whole way? – ShadowRanger Jan 13 '21 at 04:29
@ShadowRanger for this simple case (just reading lines and stripping the b'\n' at the end of each line) your suggestion is correct and adequate. – tzot Jan 13 '21 at 15:32

score 12 · Answer 7 · answered Jan 29 '19 at 00:38

12

encoding this line fixed it for me.

m.update(line.encode('utf-8'))

answered Jan 29 '19 at 00:38

Mike Cash

197
3
13

score 10 · Answer 8 · edited May 18 '15 at 10:54

10

You could open the file in binary mode:

import hashlib

with open(hash_file) as file:
    control_hash = file.readline().rstrip("\n")

wordlistfile = open(wordlist, "rb")
# ...
for line in wordlistfile:
    if hashlib.md5(line.rstrip(b'\n\r')).hexdigest() == control_hash:
       # collision

edited May 18 '15 at 10:54

NorthCat

8,315
16
40
45

answered Mar 25 '14 at 19:36

jfs

346,887
152
868
1,518

3

I am absolutely amazed I had to scroll down this far to find the first sane answer. Unless there is some reason to think the `wordlist` file is in the wrong encoding (and must therefore be decoded from the wrong encoding, then encoded with the correct encoding for hashing) this is by far the best solution, avoiding pointless decoding and reencoding in favor of just processing `bytes` (the source of the error in the OP's code). – ShadowRanger Jan 13 '21 at 04:25

score 3 · Answer 9 · answered Apr 05 '20 at 07:36

3

If it's a single line string. wrapt it with b or B. e.g:

variable = b"This is a variable"

or

variable2 = B"This is also a variable"

answered Apr 05 '20 at 07:36

SBimochan

217
2
12

score -5 · Answer 10 · answered Jun 23 '18 at 18:01

This program is the bug free and enhanced version of the above MD5 cracker that reads the file containing list of hashed passwords and checks it against hashed word from the English dictionary word list. Hope it is helpful.

I downloaded the English dictionary from the following link https://github.com/dwyl/english-words

# md5cracker.py
# English Dictionary https://github.com/dwyl/english-words 

import hashlib, sys

hash_file = 'exercise\hashed.txt'
wordlist = 'data_sets\english_dictionary\words.txt'

try:
    hashdocument = open(hash_file,'r')
except IOError:
    print('Invalid file.')
    sys.exit()
else:
    count = 0
    for hash in hashdocument:
        hash = hash.rstrip('\n')
        print(hash)
        i = 0
        with open(wordlist,'r') as wordlistfile:
            for word in wordlistfile:
                m = hashlib.md5()
                word = word.rstrip('\n')            
                m.update(word.encode('utf-8'))
                word_hash = m.hexdigest()
                if word_hash==hash:
                    print('The word, hash combination is ' + word + ',' + hash)
                    count += 1
                    break
                i += 1
        print('Itiration is ' + str(i))
    if count == 0:
        print('The hash given does not correspond to any supplied word in the wordlist.')
    else:
        print('Total passwords identified is: ' + str(count))
sys.exit()

How to correct TypeError: Unicode-objects must be encoded before hashing?

10 Answers10

EDIT

Linked

Related