Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

210 questions
409
votes
40 answers

Options for HTML scraping?

I'm thinking of trying Beautiful Soup, a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well. The story so…
Mark Harrison
  • 267,774
  • 112
  • 308
  • 434
266
votes
32 answers

Extracting text from HTML file using Python

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad. I'd like something more robust than using regular expressions that may…
John D. Cook
  • 28,187
  • 10
  • 63
  • 93
168
votes
11 answers

Extract part of a regex match

I want a regular expression to extract the title from a HTML page. Currently I have this: title = re.search('.*', html, re.IGNORECASE).group() if title: title = title.replace('', '').replace('', '') Is there a…
hoju
  • 24,959
  • 33
  • 122
  • 169
157
votes
10 answers

How to extract img src, title and alt from html using php?

I would like to create a page where all images which reside on my website are listed with title and alternative representation. I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, title and…
Sam
  • 26,538
  • 45
  • 157
  • 240
138
votes
10 answers

BeautifulSoup Grab Visible Webpage Text

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the…
user233864
  • 1,597
  • 2
  • 12
  • 12
71
votes
3 answers

Using BeautifulSoup to find a HTML tag that contains certain text

I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11}

this is cool #12345678901

So, the previous would match by using: soup('h2',text=re.compile(r' #\S{11}')) And the results would be…
sotangochips
  • 2,580
  • 4
  • 27
  • 38
69
votes
9 answers

parsing HTML on the iPhone

Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate. Does such a library exist, or am I better off just trying to use regular expressions?
Sophie Alpert
  • 126,406
  • 35
  • 212
  • 233
66
votes
15 answers

What is the best way to parse html in C#?

I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.
Luke
  • 17,750
  • 24
  • 81
  • 108
32
votes
11 answers

"Smart" way of parsing and using website data?

How does one intelligently parse data returned by search results on a page? For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers' websites. I could get…
bluebit
  • 2,909
  • 7
  • 31
  • 41
22
votes
2 answers

Create Great Parser - Extract Relevant Text From HTML/Blogs

I'm trying to create a generalized HTML parser that works well on Blog Posts. I want to point my parser at the specific entrie's URL and get back clean text of the post itself. My basic approach (from python) has been to use a combination of…
nartz
20
votes
9 answers

How do screen scrapers work?

I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts.
19
votes
5 answers

How do you parse an HTML in vb.net

I would like to know if there is a simple way to parse HTML in vb.net. I know that HTML is not sctrict subset of XML, but it would be nice if it could be treated that way. Is there anything out there that would let me parse HTML in an XML-like way…
tooleb
  • 612
  • 3
  • 6
  • 14
19
votes
8 answers

C# - Best Approach to Parsing Webpage?

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this? I've tried saving the string as an .xml…
MattSayar
  • 2,008
  • 6
  • 23
  • 28
19
votes
11 answers

regular expression to extract text from HTML

I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or more) that will achieve that?
Ron Harlev
  • 15,010
  • 24
  • 83
  • 128
19
votes
8 answers

Text Extraction from HTML Java

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code…
MajorMajor
1
2 3
13 14