4

At the risk of getting redirected to this answer (yes, I read it and spent the last 5 minutes laughing out loud at it), allow me to explain this issue, which is just one in a list of many.

My employer asked me to review a site written in PHP, using Smarty for templates and MySQL as the DBMS. It's currently running very slowly, taking up to 2 minutes (with a entirely white screen through it all, no less) to load completely.

Profiling the code with xdebug, I found a single preg_replace call that takes around 30 seconds to complete, which currently goes through all the HTML code and replaces each URL found to its SEO-friendly version. The moment it completes, it outputs all of the code to the browser. (As I said before, that's not the only issue -the code is rather old, and it shows-, but I'll focus on it for this question.)

Digging further into the code, I found that it currently looks through 1702 patterns with each appropriate match (both matches and replacements in equally-sized arrays), which would certainly account for the time it takes.

Code goes like this:

//This is just a call to a MySQL query which gets the relevant SEO-friendly URLs:   
$seourls_data = $oSeoShared->getSeourls();

$url_masks = array();
$seourls = array();
foreach ($seourls_data as $seourl_data)
{
    if ($seourl_data["url"])
    {
        $url_masks[] = "/([\"'\>\s]{1})".$site.str_replace("/", "\/", $seourl_data["url"])."([\#|\"'\s]{1})/";
        $seourls[] = "$1".MAINSITE_URL.$seourl_data["seourl"]."$2";
    }
}

//After filling both $url_masks and $seourls arrays, then the HTML is parsed:
$html_seo = preg_replace($url_masks, $seourls, $html);
//After it completes, $html_seo is simply echo'ed to the browser.

Now, I know the obvious answer to the problem is: don't parse HTML with a regexp. But then, how to solve this particular issue? My first attempt would probably be:

  1. Load the (hopefully, well-formed) HTML into a DOMDocument, and then get each href attribute in each a tag, like so.
  2. Go through each node, replacing the URL found for its appropriate match (which would probably mean using the previous regexps anyway, but on a much-reduced-size string)
  3. ???
  4. Profit?

but I think it's most likely not the right way to solve the issue. Any ideas or suggestions?

Thanks.

Community
  • 1
  • 1
kenshin23
  • 147
  • 10
  • 1
    I would say, don't put bad urls in the html to begin with, put the good ones. Rather than this very slow way. – Josh Apr 24 '14 at 23:34
  • 3
    As it stands, the site is unusable. So, the first thing I'd do is to create a new column in the database, move the filtering stuff to an offline process, and record the results in the new column. Then the site will at least work, and you can get to fixing the implementation. – halfer Apr 24 '14 at 23:35
  • 1
    `that's not the only issue -the code is rather old, and it shows`, time for a rewrite then. Unfortunately there's no fast way todo what you want, unless you fix the urls in your content, regex-ing 1702 patterns is a little bit of a concern. – Lawrence Cherone Apr 24 '14 at 23:37
  • @Josh makes a good point though - can you not optimise the contents of the content database to begin with? (Aside: this question might be a touch discursive, but it's detailed and involves code, so I'll leave it alone for now. +1). – halfer Apr 24 '14 at 23:38
  • +1 on one time fix to HTML content in database. It is fine to run this script once and have it take 2+ minutes to update the content in the database. It is totally crazy to do this with every page load. – Mike Brant Apr 25 '14 at 00:16
  • I'm wondering how the regexes look like, I'm certain that the one who wrote it isn't really fluent in it: `[\"'\>\s]{1}` could be simplified to `[\"'>\s]` and `[\#|\"'\s]{1}` to `[#|\"'\s]` – HamZa Apr 25 '14 at 00:33
  • To put this simply, you need to change your method. You aren't going to have a solution that is doing thousands of URL replacements on every page load, and run quickly. It's not going to happen, this is simply an inefficient way to go about this. I would recommend ask yourself WHY these URLs are being replaced on every page load, and if it's really necessary. If you're already using smarty templating, it sounds like something that can be resolved via a dynamic variable and restructuring of the HTML. Please provide more details of the URL replacements. – rocksfrow Apr 25 '14 at 12:53
  • 2
    Thanks for all the comments (and sorry for the long-winded post, just thought I'd make it as clear as possible.) Yes, a complete site rewrite is in progress (using Erlang/ChicagoBoss instead) but since it's not going to be ready soon enough, the old site was put online for the moment. Unfortunately, it's too much of an effort to put the site in a working state, so my employer has told me to forget about it. Great chance to learn about how (not) to code, thougn. – kenshin23 Apr 25 '14 at 18:17
  • 1
    Maybe the function be optimized, did you identify the time taken by the execution of a regex. – Shire Apr 30 '14 at 03:13
  • Btw, for correctness, it's best to use `preg_quote()` instead of `str_replace()` to fix only the expression delimiters. – Ja͢ck May 02 '14 at 15:21
  • See [this answer](http://stackoverflow.com/a/4234491/471272). – tchrist Jun 06 '14 at 22:39

2 Answers2

0

As your goal is to be SEO-friendly, using canonical tag in the target pages would tell the search engines to use your SEO-friendly urls, so you don't need to replace them in your code...

Sly
  • 364
  • 2
  • 9
0

Oops ,That's really tough, bad strategy from the beginning , any way that's not your fault,
i have 2 suggestion:-
1-create a caching technique by smarty so , first HTML still generated in 2 min >
second HTMl just get from a static resource .

2- Don't Do what have to be done earlier later , so fix the system ,create a database migration that store the SEO url in a good format or generate it using titles or what ever, on my system i generate SEO links in this format ..

www.whatever.com/jobs/722/drupal-php-developer where i use 722 as Id by parsing the url to get the right page content and (drupal-php-developer) is the title of the post or what ever

3 - ( which is not a suggestion) tell your client that project is not well engineered (if you truly believe so ) and need a re structure to boost performance . run

Shady Keshk
  • 428
  • 5
  • 13