Use Amazon S3 and Cloudfront for intelligently caching webapges

Question

I have a website (running within Tomcat on Elastic Beanstalk) that generates artist discographies (a single page for one artist). This can be resource intensive, so as the artist pages don't change over a month period I put a CloudFront Distribution in front of it.

I thought this would mean no artist request ever had to be served more than once by my server however its not quite as good as that. This post explains that every edge location (Europe, US etc.) will get a miss the first time they look up the resource and that there is a limit to how many resources are kept in the cloudfront cache so they could be dropped.

So to counter this I have changed by server code to store a copy of the webpage in a bucket within S3 AND to check this first when a request comes in, so if the artist page already exists in S3 then the server retrieves it and returns its contents as the webpage. This greatly reduces the processing as it only constructs a webpage for a particular artist once.

However:

The request still has to go to the server to check if the artist page exists.
If the artist page exists then the webpage (and they can sometimes be large up-to 20mb) is first downloaded to the server and then server returns the page.

So I wanted to know if I could improve this - I know you can construct an S3 bucket as a redirect to another website. Is there a per-page way I could get the artist request to go to the S3 bucket and then have it return the page if it exists or call server if it does not?

Alternatively could I get the server to check if page exists and then redirect to the S3 page rather than download the page to the server first?

score 2 · Accepted Answer · edited May 23 '17 at 12:22

OP says:

they can sometimes be large up-to 20mb

Since the volume of data you serve can be pretty large, I think it is feasible for you to do this in 2 requests instead of one, where you decouple the content generation from the content serving part. The reason to do this is so as to minimize the amount of time/resources it takes on the server to fetch data from S3 and serve it.

AWS supports pre-signed URLs which can be valid for a short amount of time; We can try using the same here to avoid issues around security etc.

Currently, your architecture looks something like below, wherein. the client initiates a request, you check if the requested data exists on the S3 and then fetch and serve it if there, else you generate the content, and save it to S3:

                           if exists on S3
client --------> server --------------------> fetch from s3 and serve
                    |
                    |else
                    |------> generate content -------> save to S3 and serve

In terms of network resources, you always consume 2X the amount of bandwidth and time here. If the data exists, then once you have to pull it from server and serve it to customer (so it is 2X). If the data doesn't exist, you send it to customer and to S3 (so again it is 2X)

Instead, you can try 2 approaches below, both of which assume that you have some base template, and that the other data can be fetched via AJAX calls, and both of which bring down that 2X factor in the overall architecture.

Serve the content from S3 only. This calls for changes to the way your product is designed, and hence may not be that easily integrable.

Basically, for every incoming request, return the S3 URL for it if the data already exists, else create a task for it in SQS, generate the data and push it to S3. Based on your usage patterns for different artists, you should be having an estimate of how much time it takes to pull together the data on the average, and so return a URL which would be valid with the estimated_time_for_completetion(T) of the task.

The client waits for time T, and then makes the request to the URL returned earlier. It makes upto say 3 attempts for fetching this data in case of failure. In fact, the data already existing on S3 can be thought of as the base case when T = 0.

In this case, you make 2-4 network requests from the client, but only the first of those requests comes to your server. You transmit the data once to S3 only in the case it doesn't exists and the client always pulls in from S3.
```
                           if exists on S3, return URL
client --------> server --------------------------------> s3
                    |
                    |else SQS task
                    |---------------> generate content -------> save to S3 
                     return pre-computed url


           wait for time `T`
client  -------------------------> s3
```

Check if data already exists, and make second network call accordingly.

This is similar to what you currently do when serving data from the server in case it doesn't already exist. Again, we make 2 requests here, however, this time we try to serve data synchronously from the server in the case it doesn't exist.

So, in the first hit, we check if the content had ever been generated previously, in which case, we get a successful URL, or error message. When successful, the next hit goes to S3.

If the data doesn't exist on S3, we make a fresh request (to a different POST URL), on getting which, the server computes data, serves it, while adding an asynchronous task to push it to S3.
```
                           if exists on S3, return URL
client --------> server --------------------------------> s3

client --------> server ---------> generate content -------> serve it
                                       |
                                       |---> add SQS task to push to S3
```

I like option 2, security is not really an issue (i think) as I want the data to be publicably assessible my only problem is this. — Paul Taylor, May 04 '15 at 19:42
that currently everything served via http://server/id so clever user can directly type the url. With new approach http://server/id only tells if valid, content served from s3://s3bucket/id but this will give error if clever user goes straight to s3 if page has never been requested, or is not really an issue ? — Paul Taylor, May 04 '15 at 21:27
@PaulTaylor If the user is clever, he will easily figure out that he has to go to your webserver and make the request there :) Also, http://server.com/id tells the user whether the content exists on S3 or not, and accordingly tells him (redirect him?) about the appropriate URL for the content. — Anshul Goyal, May 05 '15 at 03:45
right ok. What cant work is whether server should http redirect to the s3 location or just return a an address as data that can then be used to construct a url , (because I cant follow how this effects how Cloudfront works) — Paul Taylor, May 05 '15 at 07:43
Ok I did a simpler version of 2> if page exists server does a http redirect to the s3 url. If it doesnt exist it creates the page, uploads it to s3 and then redirects to the s3 url. I have also stuck a cloudfront distribution between in front of the s3 calls but not the server, so every request goes to the server but the s3 lookup is cheap so I dont think this matters, then if page already exists the call to get it will go to cloudfront so if already looked up for that edge location no need to even go to s3. — Paul Taylor, Jun 08 '15 at 09:44
@PaulTaylor Good to know that it worked. Note that in my approach, the content is served slightly faster (since it is uploaded to s3 asynchronously), but given such hits should be rare for new content, your solution should be performant as well. — Anshul Goyal, Jun 09 '15 at 05:07

score 0 · Answer 2 · answered Apr 24 '15 at 17:12

0

CloudFront cache redirect, but does not follow it: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#ResponseCustomRedirects.

You did not provide specific numbers, but will it work for you to pregenerate all these pages and put them to S3 and point CloudFront directly to S3?

If it is doable, there are couple of benefits:

You will decouple content generation from content serving which will make system more stable in overall
Performance requirements for the content generator will be much lower as it could move as slowly as it wish regenerating content

Definitely if you don't know which pages you have to generate in advance it won't work.

answered Apr 24 '15 at 17:12

Alex Z

1,827
1
29
26

Yeah its not doable to pregenerate all pages because that would require generating about 1 million pages which would take alot of cpu/time, and it would be a waste as most of these pages would never get looked at, so I need to do it on demand, this is the site albunack.net – Paul Taylor Apr 25 '15 at 08:49
Back to the redirect, Im unclear what would happen if the server sent a redirect to the page on s3 instead of retning the page itself, and what would happen the next time a user requests the page from Cloudfront. – Paul Taylor Apr 25 '15 at 08:51

score 0 · Answer 3 · answered May 01 '15 at 08:05

Although I've not done it before, this would be a technique I'd look at.

Start by setting up the S3 bucket as you've described, as a "redirect" for a website.
Have a look at the S3 Event Handlers. They only deal with when an S3 object is created, but you could try doing a GET to start with and if it fails respond with a POST or PUT to that same path, placing in a "marker" file or calling an API that will trigger an event?

https://aws.amazon.com/blogs/aws/s3-event-notification/ http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

Once the event is triggered, either have your server listen via SQS for an event, or move your artist creator code into AWS Lambda which will feed off of SNS.

My only concern is where that GET will be coming from. You don't want anyone hitting your S3 bucket with an invalid POST - you'd be generating all over the place. But I'll leave that as an exercise for the reader.

Hmm, my concern is this seems terribly complicated and tied my code more into an AWS infrastructure, not something I particularly want to. Currently the only AWS specific code is an S3 Put and an S3 Get. — Paul Taylor, May 04 '15 at 19:38
I'd beware of not committing to a solution. If you are going to use AWS, by all means make sure you have interfaces, but you are going to miss out on some really neat features if you back off from it - the power of AWS isn't from each solution, it's from the (can't believe I'm using this word) synergy between them. I'm not seeing this as a particularly complicated solution either, but then I'm not going to be the one implementing it :) — Spedge, May 05 '15 at 08:19

score 0 · Answer 4 · answered May 02 '15 at 19:13

0

Why not put a web server like ngx or apache in front of tomcat? Means tomat runs on some other port like 8085, web server runs on 80. It gets hits and has its own cache. Then you dont need S3 at all but can do back to your server + Cloudfront.

So Cloudfront hits your web server, if its in cache, return page directly. Else go to tomcat.

Cache can be in same process or a redis ... dependong on total size of data you need to cache.

answered May 02 '15 at 19:13

tgkprog

4,405
4
39
66

I like the easy deployment and the easy scaling offered by EB, I assume adding caching into the mix would break this. Also the pages are still being served by the same machine and if the pages are cached on my server they would all be lost if I redeployed - so I dont thinkk this is a workable solution. – Paul Taylor May 04 '15 at 19:30

Use Amazon S3 and Cloudfront for intelligently caching webapges

4 Answers4

Linked