Decoding/ Encoding how to ignore possible errors and remove/replace wrong chars?

Question

I'm using urllib2 to get the sourcecode of a website which I then filter with regex for a bas64 encoded string, and iterate over it passing the matches to a function:

def Base64Decoder(match):  
    curMatch = match.group().decode('utf-8', errors='ignore')  
    decoded = base64.b64decode(curMatch)   
    return decoded

When I print out the returned value of Base64Decoder some chars are wrong, how do I filter them out correctly? I don't want to see gibberish chars like the following:

Cygwin linux

The website's encoding is utf-8 but the returned value of urllib seems to be unicode?

Edit: the sourcecode looks like this (raw)

<td style="text-align:left; font-weight:bold;"><script type="text/javascript">document.write(Base64.decode("MzEuMTMuMTcuMjE0"))</script></td>

and the filtered string is Base64.decode("MzEuMTMuMTcuMjE0 striped to MzEuMTMuMTcuMjE0

The Windows console does not by default show UTF-8 encoded characters very well. — Some programmer dude, Sep 18 '13 at 07:30
Its not the windows console, its cygwin and on linux it doesn't show either. — , Sep 18 '13 at 07:30

score 1 · Accepted Answer · edited May 23 '17 at 10:29

1

You are probably not stripping it correctly, the Base64.decode(" prefix is also left in your string after the strip. You can see that in the next example:

>>> print base64.b64decode('Base64.decode("MzEuMTMuMTcuMjE0')
��^r�^31.13.17.214

If you have a pattern similar to this:

>>> pattern = re.compile('Base64.decode\("(...)"\)')

(See SO question: RegEx to parse or validate Base64 data)

group() will return the fully matched string:

>>> pattern.search(s).group()
'Base64.decode("MzEuMTMuMTcuMjE0")'

The thing you need is:

>>> pattern.search(s).groups()[0]
'MzEuMTMuMTcuMjE0'

edited May 23 '17 at 10:29

Community

1
1

answered Sep 18 '13 at 08:06

Viktor Kerkez

38,587
11
96
81

I am striping it correctly. I'm passing only the base64 encoded string to the function which should return the decoded string. – Sep 18 '13 at 08:08
@Daapii can you add a `print curMatch` to your function to check? – Viktor Kerkez Sep 18 '13 at 08:09
Very funny justhalf. @ViktorKerkez I've found the problem I've commented the stripping part out because I've been testing and forgot to remove the hash tag. thanks. – Sep 18 '13 at 08:15
@Daapii Updated a the answer, added the possible problem solution. – Viktor Kerkez Sep 18 '13 at 08:20
@ViktorKerkez your regex code could be optimized to just \w instead of \w\d (\w is alphanumeric), thanks for your effort :) – Sep 18 '13 at 08:53
@Daapii Actually its completely incorrect :D Updated the answer. – Viktor Kerkez Sep 18 '13 at 09:00

Decoding/ Encoding how to ignore possible errors and remove/replace wrong chars?

1 Answers1