5

I'm making a web scraper, most of the data on the web page is in JavaScript object literal form, e.g.:

// Silly example
var user = {
    name: 'John', 
    surname: 'Doe',
    age: 21,
    family: [
        {
            name: 'Jane',
            surname: 'Doe',
            age: 37
        },
        // ...
    ]
};

So when I search for the contents in my JavaScript app the Object above would be:

"{name: 'John', surname: 'Doe', age: 21, family: [{name: 'Jane', surname: 'Doe', age: 37}]}"

Is it possible to parse those to regular JavaScript Objects without using 'eval' or making my own parser? I saw other similar questions about this but the answers are not applicable: they all suggest JSON.parse() (not applicable) and eval (I can't use it for security reasons). In this question, for example, all the answers suggest eval or new Function() which are basically the same thing.

If there are no other ways would it be a viable option to convert the literal to proper JSON and then parse it to JavaScript object?

This is what I tried right now, it worked on a simple object but I'm not sure it will work everywhere:

const literal = script.innerText.slice(script.innerText.indexOf('{'), script.innerText.lastIndexOf('}') + 1);
const json = literal.replace(/.*:.*(\".*\"|\'.*\'|\[.*\]|\{.*\}|true|false|[0-9]+).*,/g, (prev) => {
  let parts = prev.split(':');
  let key = '"' + parts.shift().trim() + '"';
  let value = parts.join(':').replace(/'.*'/, (a) => {
    return '"' + a.slice(1, a.length - 1) + '"';
  }).trim();
  return key + ':' + value;
});
const obj = JSON.parse(json);
Fr3ddyDev
  • 308
  • 4
  • 9
  • 4
    An object literal is a JS object. If you need a JSON string, then see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON – Teemu Jun 18 '19 at 12:24
  • 2
    @Teemu but based on the statement that OP is using a web scraper, I'd assume that he has it in string format. I'd be interested in why it isn't possible to use JSON.parse – Tom M Jun 18 '19 at 12:25
  • if it is in a file, you can load JS https://stackoverflow.com/questions/14521108/dynamically-load-js-inside-js – Slai Jun 18 '19 at 12:26
  • Is this between script tags in a HTML document, or an included file? – Lewis Jun 18 '19 at 12:27
  • 1
    You could use JavaScript parser like [esprima](http://esprima.org/) or [acorn](https://github.com/acornjs/acorn) – ponury-kostek Jun 18 '19 at 12:28
  • 2
    @TomM It looks like I'm not reading enough the question nowadays. Yes, the provided example is most likely a string, but it is not a JSON string, that's why JSON.parse can't parse it. – Teemu Jun 18 '19 at 12:28
  • 1
    @TomM `I'd be interested in why it isn't possible to use JSON.parse` Because even if you take away `var user =` and the trailing `;`, its not valid JSON. eg.. `{hello: 'there'}` is not valid JSON. – Keith Jun 18 '19 at 12:29
  • 1
    @TomM `JSON.parse()` is not usable beacuse the Object literal shown is valid JavaScript but not valid JSON – Fr3ddyDev Jun 18 '19 at 12:29
  • @Fr3ddyDev could you post an example of how you are currently trying to consume the string inside your app? – Tom M Jun 18 '19 at 12:29
  • @Lewis between scripts tag, the problem is not extracting the object string but parsing it to a JavaScript variable – Fr3ddyDev Jun 18 '19 at 12:30
  • So your option is to write a parser or write code to convert it to valid JSON. – epascarello Jun 18 '19 at 12:31
  • @Fr3ddyDev Fair enough, knowing that gives us some context though. – Lewis Jun 18 '19 at 12:31
  • Obligatory note that parsing strings with `eval(…)`, `require(…)`, `new Function(…)`, `document.createElement('script')`, et al, will cause the code in the string to be executed, which is probably dangerous if the string comes from outside your program. – Quinn Comendant Jun 29 '20 at 17:32

4 Answers4

3

It's simple demo how you can use esprima to get globally declared variables

"use strict";

const src = `
var user = {
 name: 'John',
 surname: 'Doe',
 age: 21,
 family: [
  {
   name: 'Jane',
   surname: 'Doe',
   age: 37
  },
  // ...
 ]
};`;
const src2 = `
var a = [1,2,3], b = true;
var s = "some string";
var o = {a:1}, n = null;
var some = {'realy strange' : {"object":"'literal'"}}
`;

function get_globals(src) {
  return esprima.parse(src).body
    .filter(({type}) => type === "VariableDeclaration") // keep only variables declarations
    .map(({declarations}) => declarations)
    .flat()
    .filter(({type}) => type === "VariableDeclarator")
    .reduce((vars, {id: {name}, init}) => {
      vars[name] = parse(init);
      return vars;
    }, {});
}

console.log(get_globals(src));
console.log(get_globals(src2));

/**
 * Parse expression
 * @param expression
 * @returns {*}
 */
function parse(expression) {
  switch (expression.type) {
    case "ObjectExpression":
      return ObjectExpression(expression);
    case "Identifier":
      return expression.name;
    case "Literal":
      return expression.value;
    case "ArrayExpression":
      return ArrayExpression(expression);
  }
}

/**
 * Parse object expresion
 * @param expression
 * @returns {object}
 */
function ObjectExpression(expression) {
  return expression.properties.reduce((obj, {key, value}) => ({
    ...obj,
    [parse(key)]: parse(value)
  }), {});
}

/**
 * Parse array expression
 * @param expression
 * @returns {*[]}
 */
function ArrayExpression(expression) {
  return expression.elements.map((exp) => parse(exp));
}
<script src="https://unpkg.com/esprima@~4.0/dist/esprima.js"></script>
ponury-kostek
  • 6,780
  • 4
  • 17
  • 26
2

For data like this you could might be able to use a couple of regex's to convert into a valid JSON object.

Below is an example..

ps. It might not be 100% foolproof for all object literals.

var str = "{name: 'John', surname: 'Doe', age: 21, family: [{name: 'Jane', surname: 'Doe', age: 37}]}";

var jstr = str
  .replace(/\'(.*?)\'/g, '"$1"')
  .replace(/([\{ ]*?)([a-z]*?)(\:)/gi, '$1"$2"$3');

var obj = JSON.parse(jstr);

console.log(obj);

As pointed out by @ponury-kostek, and by myself using regEx can be limited. Using some sort of AST parsing like Esprima is certainly a good idea, especially if your already using an AST parser.

But if an AST parser is overkill, a more robust version below using Javascript might be better. Ps. again it might not be 100% correct, but it should cope with the majority of Object literals.

var str = `{
  name: 'John:', surname: 'Doe', age: 21,
  family: [
    {name: 'Jane\\n\\r', surname: 'Doe', age: 37},
    {'realy:strange indeed' : {"object":"'\\"literal'"}}
  ]
}`;


const objLits = [...':0123456789, \t[]{}\r\n'];

function objParse(src) {
  const input = [...src];
  const output = [];
  let inQ = false, inDQ = false, 
    inEsc = false, inVname = false;
  for (const i of input) {
    if (inEsc) {
      inEsc = false;    
      output.push(i);
    } else if (i === "\\") {
      inEsc = true;
      output.push(i);
    } else if (i === "'" && !inDQ) {
      output.push('"');
      inQ = !inQ;
    } else if (i === '"' && !inQ) {
      output.push('"');
      inDQ = !inDQ;      
    } else if (!inVname & !inQ & !inDQ & !inEsc) {
      if (objLits.includes(i)) {
        output.push(i);
      } else {
        inVname = true;
        output.push('"');
        output.push(i);
      }
    } else if (inVname) {
      if (i === ':') {
        inVname = false;
        output.push('"');
      }
      output.push(i);
    } else {
      output.push(i);
    }
  }
  const ostr = output.join('');
  return JSON.parse(ostr);
}

console.log(objParse(str));
Keith
  • 15,057
  • 1
  • 18
  • 31
  • This will crash if you use quoted keys or if you use quotes in values (escaped), try to parse something like `var some = {'realy strange' : {"object":"'literal'"}}` – ponury-kostek Jun 18 '19 at 13:47
  • @ponury-kostek Yes, I did mention it wouldn't be 100%.. I've updated with a JS version that should be a bit more robust. PS. I use AST parser's too, and I think that's the most robust way of doing this. But for something quick and dirty I'll leave these version here. – Keith Jun 19 '19 at 14:55
  • OP had mention that he is making web scraper, that's why I suggested using some JS parser – ponury-kostek Jun 19 '19 at 16:48
0

Assuming you use node, an easy workaround would be

// scraper.js
const fs = require('fs');
const objectString = myScraper.scrape('example.com');

fs.writeFileSync('./scraped.js', objectString);

// myAppUsingTheData.js
const myObj = require('myAppUsingTheData');

However, require still involves evaluation in some way. AND you'd need separate processes in order to access your object. Also, you'd need to somehow insert module.exports. If you want to parse Objects only, try JSON5

const myObj = JSON5.parse(objectString);
console.log(myObj.name)

Using JSON5 will effectively prevent you from running malicious code that is not an object in your app and apparently can parse unquoted JSON keys.

Tom M
  • 2,431
  • 2
  • 13
  • 40
-1

A script tag can be added with the script text :

var JS = `var user = {
    name: 'John', 
    surname: 'Doe',
    age: 21,
    family: [
        {
            name: 'Jane',
            surname: 'Doe',
            age: 37
        },
    ]
};`;

var script = document.createElement('script');
script.textContent = JS
document.head.appendChild(script);

console.log( user )
Slai
  • 19,980
  • 5
  • 38
  • 44