Python - Why isn't this specific text being found by findall regex?

Question

EDIT: PLEASE DO NOT DOWNVOTE WITHOUT COMMENTING ON WHY YOU ARE DOWNVOTING. I AM TRYING MY BEST TO WRITE THIS PROPERLY!

I am trying to print all of the URL links of watches on a website. I have all of them printing fine except one, even though that one has the exact same regex conditions as the others. Can someone explain why this isn't printing please? Have I messed up some syntax somewhere? The following code should be able to be pasted into a Python editor (i.e. IDLE) and run.

## Import required modules
from urllib import urlopen
from re import findall
import re

## Provide URL
dennisov_url = 'https://denissov.ru/en/'

## Open and read URL as string named 'dennisov_html'
dennisov_html = urlopen(dennisov_url).read()

## Find all of the links when each watch is clicked (those with the designated
## preceeding text 'window.open', then any character that occurs zero or more
## times, then the text '/en/'. Remove matches with the word "History" and
## any " symbols in the URL.
watch_link_urls = findall('window.open.*(/en/[^history][^"]*/)', dennisov_html)
## For every URL, convert it into a string on a new line and add the domain
for link in watch_link_urls:
    link = 'https://denissov.ru' + link
## Print out the full URLs
    print link

## This code should show the link https://denissov.ru/en/speedster/ yet
## it isn't showing. It has the exact preceeding text as the other links
## that are printing and is in the same div container. If you inspect the 
## website then search 'en/barracuda_mechanical/ and then 'en/speedster/' 
## you will see that the speedster link is only a few lines below barracuda 
## mechanical and there is nothing different about the two's preceeding 
## text, so speedster should be printing

Oh yes, the `[^history][^"]*` part is messed up. It means any char but h, I, s, t, o, r, y followed with ant char but `"`. — Wiktor Stribiżew, May 20 '17 at 07:37

Chiheb Nexus · Answer 1 · 2017-05-20T06:09:10.867

You can try this code with this pattern:

from urllib2 import urlopen
import re

url = 'https://denissov.ru/en/'
data = urlopen(url).read()
sub_urls = re.findall('window.open\(\'(/.*?)\'', data)
# take everything without deleting dublicates
# final_urls = [k for k in b if '/history' not in k and k is not '']
# Or: remove duplicates
set(k for k in b if '/history' not in k)

for k in final_urls:
    link = 'https://denissov.ru' + k
    print link

Will output something like this:

https://denissov.ru/eng/denissovdesign/index.html
https://denissov.ru/en/barracuda_limited/
https://denissov.ru/en/barracuda_chronograph/
https://denissov.ru/en/barracuda_mechanical/
https://denissov.ru/en/speedster/
https://denissov.ru/en/free_rider/
https://denissov.ru/en/nau_automatic/
https://denissov.ru/en/lady_flower/
https://denissov.ru/en/enigma/
https://denissov.ru/en/number_one/

score 0 · Answer 2 · edited May 23 '17 at 12:34

If you want a regex to get all URLs that don't contain the word history and start with en/ then you should use a tempered greedy solution, like this:

en\/(?:(?!history).)*?\/

(?:(?!history).)*? is a tempered dot which will match any character which doesn't have history as a lookahead.
- (?!history) is a negative lookahead to ensure that.
- The ?: has been added to indicate that the group is a non-capturing one.
- The *? indicates a non-greedy match so that it will match only upto the first /

Regex101 Demo

Change the python code like this:

watch_link_urls = findall('window.open.*(/en\/(?:(?!history).)*?\/)', dennisov_html)

Output:

https://denissov.ru/en/barracuda_limited/
https://denissov.ru/en/barracuda_chronograph/
https://denissov.ru/en/barracuda_mechanical/
https://denissov.ru/en/speedster/
https://denissov.ru/en/free_rider/
https://denissov.ru/en/nau_automatic/
https://denissov.ru/en/lady_flower/
https://denissov.ru/en/enigma/
https://denissov.ru/en/number_one/

Python - Why isn't this specific text being found by findall regex?

2 Answers2