Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

210 questions

409

votes

40 answers

Options for HTML scraping?

I'm thinking of trying Beautiful Soup, a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well. The story so…

asked Aug 05 '08 at 21:09

Mark Harrison

267,774
112
308
434

266

votes

32 answers

Extracting text from HTML file using Python

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad. I'd like something more robust than using regular expressions that may…

python html text html-content-extraction

asked Nov 30 '08 at 02:28

John D. Cook

28,187
10
63
93

168

votes

11 answers

Extract part of a regex match

I want a regular expression to extract the title from a HTML page. Currently I have this: title = re.search('.*', html, re.IGNORECASE).group() if title: title = title.replace('', '').replace('', '') Is there a…

python html regex html-content-extraction

asked Aug 25 '09 at 10:24

hoju

24,959
33
122
169

157

votes

10 answers

How to extract img src, title and alt from html using php?

I would like to create a page where all images which reside on my website are listed with title and alternative representation. I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, title and…

php html regex html-parsing html-content-extraction

asked Sep 26 '08 at 08:33

Sam

26,538
45
157
240

138

votes

10 answers

BeautifulSoup Grab Visible Webpage Text

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the…

python text beautifulsoup html-content-extraction

asked Dec 20 '09 at 17:55

user233864

1,597
2
12
12

votes

3 answers

Using BeautifulSoup to find a HTML tag that contains certain text

I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11}

this is cool #12345678901

So, the previous would match by using: soup('h2',text=re.compile(r' #\S{11}')) And the results would be…

python regex beautifulsoup html-content-extraction

asked May 14 '09 at 21:46

sotangochips

2,580
4
27
38

votes

9 answers

parsing HTML on the iPhone

Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate. Does such a library exist, or am I better off just trying to use regular expressions?

iphone html parsing html-content-extraction

asked Jan 02 '09 at 00:37

Sophie Alpert

126,406
35
212
233

votes

15 answers

What is the best way to parse html in C#?

I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.

c# .net html parsing html-content-extraction

asked Sep 11 '08 at 09:16

Luke

17,750
24
81
108

votes

11 answers

"Smart" way of parsing and using website data?

How does one intelligently parse data returned by search results on a page? For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers' websites. I could get…

html web-services parsing webpage html-content-extraction

asked Aug 03 '09 at 17:04

bluebit

2,909
7
31
41

votes

2 answers

Create Great Parser - Extract Relevant Text From HTML/Blogs

I'm trying to create a generalized HTML parser that works well on Blog Posts. I want to point my parser at the specific entrie's URL and get back clean text of the post itself. My basic approach (from python) has been to use a combination of…

html parsing text-parsing html-content-extraction

asked Jul 18 '09 at 07:27

nartz

votes

9 answers

How do screen scrapers work?

I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts.

screen-scraping web-scraping html-content-extraction pdf-scraping console-scraping

asked Oct 01 '08 at 03:10

Micah

101,237
81
221
320

votes

5 answers

How do you parse an HTML in vb.net

I would like to know if there is a simple way to parse HTML in vb.net. I know that HTML is not sctrict subset of XML, but it would be nice if it could be treated that way. Is there anything out there that would let me parse HTML in an XML-like way…

.net html vb.net parsing html-content-extraction

asked Feb 05 '09 at 16:59

tooleb

votes

8 answers

C# - Best Approach to Parsing Webpage?

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this? I've tried saving the string as an .xml…

c# html xml html-content-extraction

asked Nov 18 '08 at 21:46

MattSayar

2,008
6
23
28

votes

11 answers

regular expression to extract text from HTML

I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or more) that will achieve that?

html regex html-content-extraction text-extraction

asked Oct 08 '08 at 01:43

Ron Harlev

15,010
24
83
128

votes

8 answers

Text Extraction from HTML Java

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code…

java html screen-scraping html-content-extraction text-extraction

asked Sep 06 '09 at 16:52

MajorMajor

2 3

…

13 14 Next