0

So... I have a HTML file with a table containing the data of about... 2k people. These data include (and are not restricted to) address, email, telephone, bank accounts, etc.

I need to throw all these data into a MySQL database. So far, my attempts were:

1) Place it all within a form tag and attempt to (jquery-)serialize it. Too big.

2) Find a pattern in the HTML structure so I can:

a) Write thousands of SQL insert queries or

b) Receive it within a server-side program.

Didn't find a pattern.

3) Simply access the containing database. Nope. All the data is here. Database is gone. Gone.

The file is about 46MB, and contains 1,086,177 lines altogether. There are blank lines within these. Because of this, simply attempting to open the file in Notepad++ takes about 5 seconds, running it in Chrome eats about 400MB RAM, and even VIM takes a while to run over it.

It's not quite that difficult, actually. I mean, there is a

<table align="center" width="100%">
   <tbody>

on top of it... Problem is that all data, independently what or whom's, is merely separated by endless rows, such as

 <tr>
  <td nowrap="" align="right" valign="baseline"> 
    <div align="right"><font face="Verdana, Arial, Helvetica, sans-serif" size="1"><font color="#FF0000">*</font> 
      Email:</font></div>
  </td>
  <td nowrap="" align="left" valign="baseline"><font face="Verdana, Arial, Helvetica, sans-serif" size="2" color="#0000CC"><b> 
    <font color="#990000">email</font></b></font> </td>
  <td nowrap="" align="right" valign="baseline"><font face="Verdana, Arial, Helvetica, sans-serif" size="1"> 
    __Password__: </font></td>
  <td nowrap="" align="right" valign="baseline"> 
    <div align="left"> 
      **** 
    </div>
  </td>
  </tr>

Any, any, any ideas at all are truly welcome.

EDIT: To make things clearer: I need to find a pattern. VIM/Notepad++ Find/Replace expressions to create something I can throw into a MySQL database somehow - includes SQL scripts, post-querystrings,etc.

Eric Wu
  • 751
  • 7
  • 29
  • what about opening it using MS tools, such as Excel or Access. If they can format the data you can export from there – cha Jul 03 '14 at 04:23
  • You don't need a regex pattern. You need a program/script that loads the HTML into a DOM, iterates the table rows and fills values into a prepared SQL statement. If the HTML table structure is really simple you could also use Excel/OpenOffice Calc to convert it to CSV and load it with [mysqlimport](http://dev.mysql.com/doc/refman/5.1/en/mysqlimport.html). – Tomalak Jul 03 '14 at 05:44

1 Answers1

0

It turns out that you can simplify things. First, open it in some better text editor like Notepad++, and replace the tabs, consecutive spaces, and newlines with empty strings. This should reduce the size of the file by a tiny bit.

Then open it in the browser, and run this code snippet in the console:

var cells = document.getElementsByTagName('td')
for (i = 0; i < cells.length; i++) {
    cells[i].innerHTML = cells[i].innerText
}

I ran this with your supplied HTML, and the resulting DOM (acquired throug document.body.innerHTML) was this:

<table align="center" width="100%">
   <tbody>
  <tr>
  <td nowrap="" align="right" valign="baseline">* Email:
</td>
  <td nowrap="" align="left" valign="baseline">email</td>
  <td nowrap="" align="right" valign="baseline">__Password__:</td>
  <td nowrap="" align="right" valign="baseline">****</td>
  </tr>
  </tbody>
</table>

As you can see, the unnecessary <font>, and <div> tags are no longer there. Now if you disregard the attributes, you have a table with cells that has a pattern:

<td>name</td>
<td>value</td>
<td>name</td>
<td>value</td>

[Edit START]

Okay, here's another script to further simplify things:

var cells = document.getElementsByTagName('td');
for (i = 0; i < cells.length; i++) {
    cells[i].outerHTML = '<td>' + cells[i].innerHTML + '</td>';
}

This simplifies the DOM to following:

<table align="center" width="100%">
   <tbody>
  <tr>
  <td>* Email:
</td>
  <td>email</td>
  <td>__Password__:</td>
  <td>****</td>
  </tr>
  </tbody>
</table>

[Edit END]

So these are name value pairs.

You can copy the HTML to clipboard using the snippets provided here: How do I copy to the clipboard in JavaScript?

Once copied, you can save it to a file. If this too is unusable for data extraction, you can continue simplifying the document using simple scripts one-at-a-time in this fashion.

Community
  • 1
  • 1
sampathsris
  • 19,015
  • 10
  • 59
  • 90