0

I am trying to program a javascript that'll grab the Inner HTML code from the top news story of the BBC website (http://www.bbc.co.uk/news), and put it in a txt document. I don't know much about javascript, I know more of .BAT and .VBS, but I know that they can't do this.

I'm not sure how to approach this. I thought of making it scan for a fixed outerHTML code, and then copy the inner one to txt file.

However, I can't seem to find an outerHTML code that is permanent everyday. For example, this is the title of today's.

<span class="title-link__title-text">Benefit plan 'could hit young Britons'</span>

As you see, it has the headline incorporated.

I'm using Firefox if that makes a different.

Any help would be much appreciated.

Regards,

Master-chip.

Nirus
  • 3,078
  • 2
  • 18
  • 42
Master-chip
  • 135
  • 10

3 Answers3

1

Pure client Browser approach:


Ok i made this fiddle for you and may help others too. This was interesting to me and challenging. Below are the points on how i achieved the possible solution

  • Used ECMA 5 Blob Api to create text file on the fly.
  • Loaded http://www.bbc.co.uk/news in iframe (Cross Domain origin access - See Note section below)
  • On iframe loaded event trigger a timeout using either setTimeout or setInterval (Commented - For repetitive execution hourly or daily) as per your need (Adjust time as per your need).
  • Querying the text nodes using document.querySelectorAll(".title-link span") seemed to be generic based on examining the webpage source.
  • Check out the fiddler Link

Javascript:

 (function () {
    var textFile = null,
        makeTextFile = function (text) {
            var data = new Blob([text], {
                type: 'text/plain'
            });

            // If we are replacing a previously generated file we need to
            // manually revoke the object URL to avoid memory leaks.
            if (textFile !== null) {
                window.URL.revokeObjectURL(textFile);
            }

            textFile = window.URL.createObjectURL(data);

            return textFile;
        };

    var iframe = document.getElementById('frame');    
    var commFunc = function () {
            var iframe2 = document.getElementById('frame'); //This is required to get the fresh updated DOM
            var innerDoc = iframe2.contentDocument || iframe2.contentWindow.document;            
            var getAll = Array.prototype.slice.call(innerDoc.querySelectorAll(".title-link span"));          
            var dummy = "";
            for (var obj in getAll) {
                dummy = dummy.concat("\n" + (getAll[obj]).innerText);
            }
            var link = document.createElement("a");
            link.href = makeTextFile(dummy);
            link.download = "sample.txt"
            link.click();
            console.log("Downloaded the sample.txt file");
        };

    iframe.onload = function () {
        setTimeout(commFunc, 1000); //Adjust the time required to load
        //setInterval(commFunc, 1000);
    };  

    //Click the button when the page inside the iframe is loaded
    create.addEventListener('click', commFunc);            
})();

HTML:

<span class="title-link__title-text">Benefit plan 'could hit young Britons'</span>
    <div>
        <iframe id="frame" src="http://www.bbc.co.uk/news"></iframe>
    </div>
    <button id="create">Download</button>

Note:

  • To run the above javascript on chrome you need to disable web security. The above script should run good on firefox, no tweaks needed.
  • This is a possible illustration that can be achieved using pure browser scripting. Tab should be active for periodic grabbing.
  • Targetted for modern browsers

Suggested Approach:

  • Use node.js server and you can modify the above script for to run as stanalone

  • Or any server side scripting frameworks like php, java spring etc.


Using Node js approach:


Javascript:

var jsdom = require("node-jsdom");
var fs = require("fs");
jsdom.env({
  url: "http://www.bbc.co.uk/news",
  scripts: ["http://code.jquery.com/jquery.js"],
  done: function (errors, window) {
    var $ = window.$;
    console.log("HN Links");
    $(".title-link span").each(function() {
      //console.log(" -", $(this).text());
      fs.existsSync("sample.txt") === true ? fs.appendFile("sample.txt", "\r"+ $(this).text()) : fs.writeFile("sample.txt", "\r"+ $(this).text())
    });
  }
});

Dependencies for the above code:

Hope it helped you and other also

Community
  • 1
  • 1
Nirus
  • 3,078
  • 2
  • 18
  • 42
  • WOW! This is great! Thank you. Just one small problem. When I paste the javascript code in a .js file it says that 'document' is undefined. (Line 18, Char 5. Error Code: 800A1391)At the risk of sounding silly, how can I fix that? – Master-chip Aug 11 '15 at 13:31
  • can you post me a fiddler link for what you are trying to do.. Did you take look at my fiddler link solution? – Nirus Aug 11 '15 at 13:51
  • Yeah, I saw you fiddler link. I tried to copy your Javascript code into notepad and run it under a .js extension so that it would be independent and offline. That's when I got the error code. Then I reread your answer, and I say you suggested installing node.js, which I did. Following that, I ran your .JS script again with it, where it froze when it reached document saying "ReferenceError: document is not defined." – Master-chip Aug 11 '15 at 14:17
  • Aaah! ok.. The above code works on browser only.. as explained in **suggested approach** you cant run the above snippet directly on node js. Document object is not available in node js(Hope you are not newbie). Hence you need to customize the code so that i will run on node js : like in place of blob api use filesystem, use https://github.com/tmpvar/jsdom for document model and more. – Nirus Aug 11 '15 at 14:30
  • @Master-chip based on your request i have updated the answer. – Nirus Aug 11 '15 at 15:03
  • Actually this is my first post and I've never programmed in javascript before. I don't see myself as a newbie to programming though. Thanks for all of your advice, this should come to life pretty soon. – Master-chip Aug 11 '15 at 15:34
0

My thoughts -

  1. JS can be used to get data/text from pages, but, to save it into a file, you have to use something in the backend like Python or PHP etc.,

  2. Why use JS? You can scrape the web very well using CURL. Use PHP Curl if that's easier for you.

You can scrape/download the webpage using -

<?php
    // Defining the basic cURL function
    function curl($url) {
        $ch = curl_init();  // Initialising cURL
        curl_setopt($ch, CURLOPT_URL, $url);    // Setting cURL's URL option with the $url variable passed into the function
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
        $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
        curl_close($ch);    // Closing cURL
        return $data;   // Returning the data from the function
    }
?>

Then use the function at your discretion-

<?php
    $scraped_website = curl("http://www.yahoo.com");  // Executing our curl function to scrape the webpage http://www.yahoo.com and return the results into the $scraped_website variable
?>

Reference Links-

Web scraping with PHP and CURL

Scraping in PHP with CURL

You can scrape more clearly using DIV's and Node's of HTML elements. Check these out - Part1 - Part2 - Part3

Hope it helps. Happy Coding!

bozzmob
  • 10,774
  • 15
  • 44
  • 67
-1

You want download txt file with content from html?Is this right, you can use this create txt file and download it If you want to get text from all title spans, you need do this

var txt = "";
var nodeList = document.querySelectorAll(".title-link__title-text") 
for(var i=0; i<nodeList.length;i++){
   txt+="\n"+nodeList[i].innerText; 
}

And then write txt variable to file, like in post i mentioned above.

Community
  • 1
  • 1
Alex Nikulin
  • 6,130
  • 4
  • 30
  • 31