-2

I'm scraping a webpage's HTML code and am currently trying to build a Regex to grab the information I need. The pattern repeats about 20 times in my example and is as follows: It should start with tivo (because it will either start with Ativo or Inativo) and should end in "Ver Detalhes". This pattern repeats for about 20 times as I said before.

The line of code I'm using on this is:

posts=re.findall('(ativo.*?ver det)',text,re.IGNORECASE)

But it doesn't work, as it simply gets 12 matches and I'm not understanding the reason why. I've tried using .* instead of .*? but then it only extracts 3 matches instead.

The file can be found at the following link: Source file

Is this something that is possible to extract?

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • 1
    Your file doesn't work (404), your pattern starts with "ativo" (though you're talking about "tivo"), and finally `.` will not match *newlines* by default, so this pattern would only work when all the bits are on the same line (unless yo opt into `re.DOTALL`). – Masklinn Feb 18 '20 at 10:31
  • 1
    Would you mind including sample data in your question. I would advise against using external links – JvdV Feb 18 '20 at 10:31
  • Is a newline in the text part to be found? If so, add flag re.DOTALL – Michael Butscher Feb 18 '20 at 10:32
  • Use `posts=re.findall('(?is)ativo.*?ver det',text)`. The whole explanation is at [How do I match any character across multiple lines in a regular expression?](https://stackoverflow.com/a/45981809/3832970). – Wiktor Stribiżew Feb 18 '20 at 10:35
  • @WiktorStribiżew It appears that someone hsa reopened this question. You may still vote to close, but it would also require two other votes now. – Tim Biegeleisen Feb 18 '20 at 10:38
  • 1
    @TimBiegeleisen I would like to ask you to stop reopening evident dupes. Please consider re-closing. – Wiktor Stribiżew Feb 18 '20 at 10:39
  • 1
    @TimBiegeleisen You may use **[Python regex, matching pattern over multiple lines.. why isn't this working?](https://stackoverflow.com/questions/3534507/python-regex-matching-pattern-over-multiple-lines-why-isnt-this-working)** as a close reason. – Wiktor Stribiżew Feb 18 '20 at 10:43
  • 1
    Why did you accept the answer with a non-working solution? You do not have to accept an answer just because it is the only one. Your question is a duplicate. It should be closed, and the answer below removed as the answerer does not want/can't fix the code. – Wiktor Stribiżew Feb 20 '20 at 08:46

1 Answers1

-2

Perhaps some of your desired matches occur across one or more lines, in which case the .* in your pattern would not pick up on that. A solution would be to do the search with "dot all" mode enabled, e.g.

posts = re.findall('\b(?:in)?ativo.*?ver detalhes\b', text, flags=re.IGNORECASE|re.DOTALL)

I gave this answer going verbatim on what you said in your question:

and should end in "Ver Detalhes"

If you really expect the match to end in just ver det, then use:

posts = re.findall('\b(?:in)?ativo.*?ver det', text, flags=re.IGNORECASE|re.DOTALL)
Tim Biegeleisen
  • 387,723
  • 20
  • 200
  • 263
  • The pattern is wrong, it won't match anything in the OP text. Besides, you do not know if there are any line breaks or not between the parts of the text. If the only problem is `re.S`, there is no need to re-answer an old problem. – Wiktor Stribiżew Feb 18 '20 at 10:32
  • 1
    This way it seems to not return anything at all – Gustavo Pacheco Feb 18 '20 at 10:33
  • @GustavoPacheco Sure. Try with your pattern but use `flags=re.IGNORECASE|re.DOTALL`. Or just `posts=re.findall('(?is)ativo.*?ver det',text)` – Wiktor Stribiżew Feb 18 '20 at 10:34
  • You're a genius I love you. 5 hours of trying anything and it seems like that was the reason. Could you explain to me why this happens? – Gustavo Pacheco Feb 18 '20 at 10:36
  • @GustavoPacheco I have done it. See [this post of mine](https://stackoverflow.com/a/45981809/3832970). – Wiktor Stribiżew Feb 18 '20 at 10:36
  • @GustavoPacheco because as noted in the official documentation the dot character "matches any character **except a newline**". So if the start of your pattern and the end are on different lines of the source file, `.*` will fail as it won't be able to step over the line break. `re.DOTALL` removes that restriction. – Masklinn Feb 18 '20 at 10:37