Parsing HTML doc without using tag or any other selector in Java

Question

I have a HTML page, lets say http://www.crisil.com/Ratings/RatingList/RatingDocs/_G_Telecom_Infra_India_Private_Limited_August_28_2015_RR.html

I want to parse About the Company paragraph and the below table without using any kind of selector or XPath in Java.

I know I can use XPath but I have so many different pages from different domain and XPath might change.

About the Company string will be constant but the position might vary in page to page. Please suggest some solution, I have tried Jsoup, HTMLUnit , DocumentBuilder and some other libraries but looks like most of them rely on tags.

Why is the requirement not to use XPath? You search for something like `About CRISIL LIMITED` — Ahmed Ashour, Nov 10 '15 at 07:41
You could use XPath `contains()` to select by text, [see this](http://stackoverflow.com/questions/1064968/how-to-use-xpath-contains-here) (you will still have to use tags in some fashion - that's how HTML is structured - but this approach may help you avoid classes and other things that can change). — halfer, Nov 10 '15 at 08:03
Because I have n number of different sources, Now I am using a general xpath using java xpathFactory to get the table , but iteration is now a big problem — spondon majumdar, Nov 11 '15 at 11:55

score 0 · Answer 1 · answered Nov 10 '15 at 06:01

0

you can use beautifulsoup its a python library http://www.crummy.com/software/BeautifulSoup/

However you should have shown us your code trials, so we could possibly help you with your existing code. I could show you some code, its a trivial thing in BeautifulSoup to look for the next Table element after a given part like About the company that you are reading. Write some code in it, and if it doesn't work for you, we'll help.

answered Nov 10 '15 at 06:01

Brij Raj Singh - MSFT

4,463
6
32
54

solution should be in java – spondon majumdar Nov 11 '15 at 11:49

score 0 · Answer 2 · edited May 23 '17 at 11:59

0

XPath does have the ability to select elements by innertext.

Check here: XPath selection by innertext

edited May 23 '17 at 11:59

Community

1
1

answered Nov 10 '15 at 22:22

N K

381
2
14

score 0 · Answer 3 · answered Nov 11 '15 at 14:55

I would use HtmlUnit and than go for the id="AboutCompanySecDivEdit"

page.getElementById("AboutCompanySecDivEdit");

which will return:

<div style="TEXT-ALIGN: justify; WIDTH: 100%; FONT-FAMILY: verdana, 'ms sans serif', arial; FONT-SIZE: 12px" id="AboutCompanySecDivEdit" jquery171011939482107256965="3">
    <p>
        <span style="FONT-FAMILY: verdana, 'ms sans serif', arial; FONT-SIZE: 12px">Incorporated in 2009, Hyderabad-based 3GTI, is an infrastructure provider of fiber optic in Andhra Pradesh. 3GTI owns a robust fiber network across Andhra Pradesh. 3GT) offers solutions for Enterprise Businesses
            &amp; service Providers. The company is promoted by Mrs.Yarla Geetha, Mrs. M Ratna Kumari &amp; Mrs. Nusrat Moinuddin.</span>
    </p>
</div>

This will only work is all your web sites hve this id set like the one you gave as example.

Parsing HTML doc without using tag or any other selector in Java

3 Answers3