I'm using urllib2 to get the sourcecode of a website which I then filter with regex for a bas64 encoded string, and iterate over it passing the matches to a function:
def Base64Decoder(match): curMatch = match.group().decode('utf-8', errors='ignore') decoded = base64.b64decode(curMatch) return decoded
When I print out the returned value of Base64Decoder some chars are wrong, how do I filter them out correctly? I don't want to see gibberish chars like the following:
The website's encoding is utf-8 but the returned value of urllib seems to be unicode?
Edit: the sourcecode looks like this (raw)
<td style="text-align:left; font-weight:bold;"><script type="text/javascript">document.write(Base64.decode("MzEuMTMuMTcuMjE0"))</script></td>
and the filtered string is Base64.decode("MzEuMTMuMTcuMjE0
striped to MzEuMTMuMTcuMjE0