Copy InnerHTML to text file Daily using javascript

Question

I am trying to program a javascript that'll grab the Inner HTML code from the top news story of the BBC website (http://www.bbc.co.uk/news), and put it in a txt document. I don't know much about javascript, I know more of .BAT and .VBS, but I know that they can't do this.

I'm not sure how to approach this. I thought of making it scan for a fixed outerHTML code, and then copy the inner one to txt file.

However, I can't seem to find an outerHTML code that is permanent everyday. For example, this is the title of today's.

<span class="title-link__title-text">Benefit plan 'could hit young Britons'</span>

As you see, it has the headline incorporated.

I'm using Firefox if that makes a different.

Any help would be much appreciated.

Regards,

Master-chip.

Perhaps you'd be better off using their news feed (RSS) http://www.bbc.co.uk/news/10628494 — Lee Taylor, Aug 11 '15 at 10:39
could always try a selector like `'span[class^="title-link"]'`, but like mentioned before, you will need backend code to actually save the file. — Johan, Aug 11 '15 at 10:40
Is the class "title-link__title-text" permanent for all titles? — Kushal, Aug 11 '15 at 10:41

score 1 · Accepted Answer · edited May 23 '17 at 12:21

Pure client Browser approach:

Ok i made this fiddle for you and may help others too. This was interesting to me and challenging. Below are the points on how i achieved the possible solution

Used ECMA 5 Blob Api to create text file on the fly.
Loaded http://www.bbc.co.uk/news in iframe (Cross Domain origin access - See Note section below)
On iframe loaded event trigger a timeout using either setTimeout or setInterval (Commented - For repetitive execution hourly or daily) as per your need (Adjust time as per your need).
Querying the text nodes using document.querySelectorAll(".title-link span") seemed to be generic based on examining the webpage source.
Check out the fiddler Link

Javascript:

 (function () {
    var textFile = null,
        makeTextFile = function (text) {
            var data = new Blob([text], {
                type: 'text/plain'
            });

            // If we are replacing a previously generated file we need to
            // manually revoke the object URL to avoid memory leaks.
            if (textFile !== null) {
                window.URL.revokeObjectURL(textFile);
            }

            textFile = window.URL.createObjectURL(data);

            return textFile;
        };

    var iframe = document.getElementById('frame');    
    var commFunc = function () {
            var iframe2 = document.getElementById('frame'); //This is required to get the fresh updated DOM
            var innerDoc = iframe2.contentDocument || iframe2.contentWindow.document;            
            var getAll = Array.prototype.slice.call(innerDoc.querySelectorAll(".title-link span"));          
            var dummy = "";
            for (var obj in getAll) {
                dummy = dummy.concat("\n" + (getAll[obj]).innerText);
            }
            var link = document.createElement("a");
            link.href = makeTextFile(dummy);
            link.download = "sample.txt"
            link.click();
            console.log("Downloaded the sample.txt file");
        };

    iframe.onload = function () {
        setTimeout(commFunc, 1000); //Adjust the time required to load
        //setInterval(commFunc, 1000);
    };  

    //Click the button when the page inside the iframe is loaded
    create.addEventListener('click', commFunc);            
})();

HTML:

<span class="title-link__title-text">Benefit plan 'could hit young Britons'</span>
    <div>
        <iframe id="frame" src="http://www.bbc.co.uk/news"></iframe>
    </div>
    <button id="create">Download</button>

Note:

To run the above javascript on chrome you need to disable web security. The above script should run good on firefox, no tweaks needed.
This is a possible illustration that can be achieved using pure browser scripting. Tab should be active for periodic grabbing.
Targetted for modern browsers

Suggested Approach:

Use node.js server and you can modify the above script for to run as stanalone
Or any server side scripting frameworks like php, java spring etc.

Using Node js approach:

Javascript:

var jsdom = require("node-jsdom");
var fs = require("fs");
jsdom.env({
  url: "http://www.bbc.co.uk/news",
  scripts: ["http://code.jquery.com/jquery.js"],
  done: function (errors, window) {
    var $ = window.$;
    console.log("HN Links");
    $(".title-link span").each(function() {
      //console.log(" -", $(this).text());
      fs.existsSync("sample.txt") === true ? fs.appendFile("sample.txt", "\r"+ $(this).text()) : fs.writeFile("sample.txt", "\r"+ $(this).text())
    });
  }
});

Dependencies for the above code:

NodeJS
JSDOM
Jquery
NodeJS Filesystem

Hope it helped you and other also

WOW! This is great! Thank you. Just one small problem. When I paste the javascript code in a .js file it says that 'document' is undefined. (Line 18, Char 5. Error Code: 800A1391)At the risk of sounding silly, how can I fix that? — Master-chip, Aug 11 '15 at 13:31
can you post me a fiddler link for what you are trying to do.. Did you take look at my fiddler link solution? — Nirus, Aug 11 '15 at 13:51
Yeah, I saw you fiddler link. I tried to copy your Javascript code into notepad and run it under a .js extension so that it would be independent and offline. That's when I got the error code. Then I reread your answer, and I say you suggested installing node.js, which I did. Following that, I ran your .JS script again with it, where it froze when it reached document saying "ReferenceError: document is not defined." — Master-chip, Aug 11 '15 at 14:17
Aaah! ok.. The above code works on browser only.. as explained in **suggested approach** you cant run the above snippet directly on node js. Document object is not available in node js(Hope you are not newbie). Hence you need to customize the code so that i will run on node js : like in place of blob api use filesystem, use https://github.com/tmpvar/jsdom for document model and more. — Nirus, Aug 11 '15 at 14:30
@Master-chip based on your request i have updated the answer. — Nirus, Aug 11 '15 at 15:03
Actually this is my first post and I've never programmed in javascript before. I don't see myself as a newbie to programming though. Thanks for all of your advice, this should come to life pretty soon. — Master-chip, Aug 11 '15 at 15:34

score 0 · Answer 2 · answered Aug 11 '15 at 11:23

My thoughts -

JS can be used to get data/text from pages, but, to save it into a file, you have to use something in the backend like Python or PHP etc.,
Why use JS? You can scrape the web very well using CURL. Use PHP Curl if that's easier for you.

You can scrape/download the webpage using -

<?php
    // Defining the basic cURL function
    function curl($url) {
        $ch = curl_init();  // Initialising cURL
        curl_setopt($ch, CURLOPT_URL, $url);    // Setting cURL's URL option with the $url variable passed into the function
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
        $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
        curl_close($ch);    // Closing cURL
        return $data;   // Returning the data from the function
    }
?>

Then use the function at your discretion-

<?php
    $scraped_website = curl("http://www.yahoo.com");  // Executing our curl function to scrape the webpage http://www.yahoo.com and return the results into the $scraped_website variable
?>

Reference Links-

Web scraping with PHP and CURL

Scraping in PHP with CURL

You can scrape more clearly using DIV's and Node's of HTML elements. Check these out - Part1 - Part2 - Part3

Hope it helps. Happy Coding!

score -1 · Answer 3 · edited May 23 '17 at 11:44

You want download txt file with content from html?Is this right, you can use this create txt file and download it If you want to get text from all title spans, you need do this

var txt = "";
var nodeList = document.querySelectorAll(".title-link__title-text") 
for(var i=0; i<nodeList.length;i++){
   txt+="\n"+nodeList[i].innerText; 
}

And then write txt variable to file, like in post i mentioned above.

Copy InnerHTML to text file Daily using javascript

3 Answers3

Pure client Browser approach:

Using Node js approach: