> User decides that he wants to download all the images in this lightbox, so presses "Download Folder". The user is then presented with a list of possible dimensions that they can request.
> The user selects "Large" and "Small" and hits "Download"
> This request gets added to our Gearman job queue.
> The job gets handled and all the files are downloaded from Amazon S3 to a temporary location on the locale file server.
> A Zip object is then created and each file is added to the Zip file.
> Once complete, the file is then uploaded back to Amazon S3 in a custom "archives" bucket.
> Before this batch job finishes, I fire off a message to Socket.io / Pusher which sends the URL back to the client who has been waiting patiently for X minutes while his job has been processing.
This works okay for us because when users create "Archives" of their ligtboxes, generally they do this because they want to share the files with other people. This means that they attach the URL to emails to provide to other people.
So for us, it's actually neccessary to save the file back to S3... however, I'm sure that not everyone needs to share the file... it would definitely be worth investigating if the user plans to return back to the archive, in which case implementing streams could potentially save us on storage and complexity.
I think you have pretty much described our original ('ghetto') solution with caching ('lipstick').
With streams, there is no need to cache, as recreating the download is dirt cheap. Essentially just a few extra header bytes to pad the zip container, ontop of the image content bytes that you will have to always send.
The use case you mentioned, of sharing the download link, works exactly the same. You send the link, and the what ever user clicks on the links gets an instant download.
True you are bufferring data through your app, instead of letting S3 take care of it. But if your on AWS, S3 to EC2 is free and fast (200mb/s+), and then bandwidth out of EC2 costs the same as S3. If it goes over an elastic IP, then a cent more per GB. You app servers also handle some load, but nodejs (or any other evented framework) live to multiplex IO, with only a few objects worth of overhead per connection.
In return, you can delete a whole load of cache and job control code. Less code to write, test and maintain.
> With streams, there is no need to cache, as recreating the download is dirt cheap.
The cost when streaming and not streaming should be pretty much the same, unless your non-streaming case is working on-disk (in which case you're comparing extremely different things and the comparison is anything but fair)
This is how we handle it currently.
> User adds images to a virtual lightbox.
> User decides that he wants to download all the images in this lightbox, so presses "Download Folder". The user is then presented with a list of possible dimensions that they can request.
> The user selects "Large" and "Small" and hits "Download"
> This request gets added to our Gearman job queue.
> The job gets handled and all the files are downloaded from Amazon S3 to a temporary location on the locale file server.
> A Zip object is then created and each file is added to the Zip file.
> Once complete, the file is then uploaded back to Amazon S3 in a custom "archives" bucket.
> Before this batch job finishes, I fire off a message to Socket.io / Pusher which sends the URL back to the client who has been waiting patiently for X minutes while his job has been processing.
This works okay for us because when users create "Archives" of their ligtboxes, generally they do this because they want to share the files with other people. This means that they attach the URL to emails to provide to other people.
So for us, it's actually neccessary to save the file back to S3... however, I'm sure that not everyone needs to share the file... it would definitely be worth investigating if the user plans to return back to the archive, in which case implementing streams could potentially save us on storage and complexity.