Questions tagged [html-parsing]

HTML parsing is the process of consuming a serialization of an HTML document and producing a representation that you can work with programmatically — e.g., in order to extract data from it. The HTML specification defines a standard algorithm for parsing HTML, which is implemented in all major browsers.

HTML parsing typically involves converting an HTML document to a tree-based Document Object Model (DOM)

https://html.spec.whatwg.org/multipage/parsing.html#parsing has the standard algorithm for parsing HTML, which is implemented in all major browsers.

See also .

5674 questions
2193
votes
30 answers

How do you parse and process HTML/XML in PHP?

How can one parse HTML/XML and extract information from it?
RobertPitt
  • 54,473
  • 20
  • 110
  • 156
409
votes
40 answers

Options for HTML scraping?

I'm thinking of trying Beautiful Soup, a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well. The story so…
Mark Harrison
  • 267,774
  • 112
  • 308
  • 434
317
votes
11 answers

Parse an HTML string with JS

I searched for a solution but nothing was relevant, so here is my problem: I want to parse a string which contains HTML text. I want to do it in JavaScript. I tried this library but it seems that it parses the HTML of my current page, not from a…
stage
  • 3,365
  • 3
  • 13
  • 8
232
votes
4 answers

How to strip HTML tags from string in JavaScript?

How can I strip the HTML from a string in JavaScript?
f.ardelian
  • 5,632
  • 7
  • 33
  • 49
217
votes
18 answers

Using regular expressions to parse HTML: why not?

It seems like every question on stackoverflow where the asker is using regex to grab some information from HTML will inevitably have an "answer" that says not to use regex to parse HTML. Why not? I'm aware that there are quote-unquote "real" HTML…
ntownsend
  • 6,884
  • 9
  • 35
  • 34
203
votes
7 answers

Parsing HTML using Python

I'm looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects. If I have a document of the form: Heading
ffledgling
  • 9,644
  • 8
  • 40
  • 65
197
votes
3 answers

Which HTML Parser is the best?

I code a lot of parsers. Up until now, I was using HtmlUnit headless browser for parsing and browser automation. Now, I want to separate both the tasks. As 80% of my work involves just parsing, I want to use a light HTML parser because it takes much…
Yatendra
  • 31,339
  • 88
  • 211
  • 291
161
votes
19 answers

Regex select all text between tags

What is the best way to select all the text between 2 tags - ex: the text between all the 'pre' tags on the page.
basheps
  • 7,504
  • 9
  • 34
  • 45
157
votes
10 answers

How to extract img src, title and alt from html using php?

I would like to create a page where all images which reside on my website are listed with title and alternative representation. I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, title and…
Sam
  • 26,538
  • 45
  • 157
  • 240
139
votes
0 answers

Robust and Mature HTML Parser for PHP

Are there any robust and mature HTML parsers available for PHP? A quick skimming of PEAR didn't turn anything up (lots of classes for generating HTML, not so much for consuming), and Google taught me a lot of people have started and then abandoned a…
Alan Storm
  • 157,413
  • 86
  • 367
  • 554
96
votes
6 answers

How do I parse a HTML page with Node.js

I need to parse (server side) big amounts of HTML pages. We all agree that regexp is not the way to go here. It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the…
Itay Moav -Malimovka
  • 48,785
  • 58
  • 182
  • 262
96
votes
5 answers

How do HTML parses work if they're not using regexp?

I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted). This is rather confusing…
Andy E
  • 311,406
  • 78
  • 462
  • 440
94
votes
8 answers

How to extract string following a pattern with grep, regex or perl

I have a file that looks something like this:
wrangler
  • 1,725
  • 1
  • 18
  • 22
84
votes
8 answers

How to normalize HTML in JavaScript or jQuery?

Tags can have multiple attributes. The order in which attributes appear in the code does not matter. For example: How can I "normalize" the HTML in Javascript, so the order of the attributes is always…
Julien
  • 5,539
  • 4
  • 35
  • 58
73
votes
7 answers

Extracting an information from web page by machine learning

I would like to extract a specific type of information from web pages in Python. Let's say postal address. It has thousands of forms, but still, it is somehow recognizable. As there is a large number of forms, it would be probably very difficult to…
Honza Javorek
  • 6,241
  • 5
  • 42
  • 63
1
2 3
99 100