0

I'm using urllib2 to get the sourcecode of a website which I then filter with regex for a bas64 encoded string, and iterate over it passing the matches to a function:

def Base64Decoder(match):  
    curMatch = match.group().decode('utf-8', errors='ignore')  
    decoded = base64.b64decode(curMatch)   
    return decoded

When I print out the returned value of Base64Decoder some chars are wrong, how do I filter them out correctly? I don't want to see gibberish chars like the following:

Cygwin linux

The website's encoding is utf-8 but the returned value of urllib seems to be unicode?

Edit: the sourcecode looks like this (raw)

<td style="text-align:left; font-weight:bold;"><script type="text/javascript">document.write(Base64.decode("MzEuMTMuMTcuMjE0"))</script></td>

and the filtered string is Base64.decode("MzEuMTMuMTcuMjE0 striped to MzEuMTMuMTcuMjE0

1 Answers1

1

You are probably not stripping it correctly, the Base64.decode(" prefix is also left in your string after the strip. You can see that in the next example:

>>> print base64.b64decode('Base64.decode("MzEuMTMuMTcuMjE0')
��^r�^31.13.17.214

If you have a pattern similar to this:

>>> pattern = re.compile('Base64.decode\("(...)"\)')

(See SO question: RegEx to parse or validate Base64 data)

group() will return the fully matched string:

>>> pattern.search(s).group()
'Base64.decode("MzEuMTMuMTcuMjE0")'

The thing you need is:

>>> pattern.search(s).groups()[0]
'MzEuMTMuMTcuMjE0'
Community
  • 1
  • 1
Viktor Kerkez
  • 38,587
  • 11
  • 96
  • 81
  • I am striping it correctly. I'm passing only the base64 encoded string to the function which should return the decoded string. –  Sep 18 '13 at 08:08
  • @Daapii can you add a `print curMatch` to your function to check? – Viktor Kerkez Sep 18 '13 at 08:09
  • Very funny justhalf. @ViktorKerkez I've found the problem I've commented the stripping part out because I've been testing and forgot to remove the hash tag. thanks. –  Sep 18 '13 at 08:15
  • @Daapii Updated a the answer, added the possible problem solution. – Viktor Kerkez Sep 18 '13 at 08:20
  • @ViktorKerkez your regex code could be optimized to just \w instead of \w\d (\w is alphanumeric), thanks for your effort :) –  Sep 18 '13 at 08:53
  • @Daapii Actually its completely incorrect :D Updated the answer. – Viktor Kerkez Sep 18 '13 at 09:00