It is not possible to have an anonymous Put to S3 and a private Get. Turns out the anonymous user owns the Object. Therefor any anonymous user (which is everyone) can read the file. That’s bad. If you figure out a way for anonymous puts with private gets, please leave a comment.
I was able to figure out a slightly different mechanism that works almost as well as a PUT/GET. Oh – I don’t want to do an authenticated Put as I don’t want to have to hit the application server for every file piece (I am uploading in multiple chunks as I believe the browser will timeout the http connection – need to figure out the timeout limits though) and I don’t want to have to include the S3 secret keys in the client code (both options don’t appeal as a solution for this use case).
For background, I am creating a file upload utility for a client. They want to provide the ability for their clients to upload 700MB – 1.4GB data files with on eye towards even larger files. I want to make sure the client has the best experience possible, I realize large files for some clients can be a burden given the current state of upload bandwidth for most people. However, upload speeds are rapidly changing. At 2Mbps a 1.4GB file takes….1 hour 40 minutes. Now on my personal puny link of 256Kbps it takes 13 hours for the upload (here is a handy calculator). So that the receiving bandwidth is not an issue and can support multiple clients concurrently, the upload is being outsourced to S3 for now.
Anyway, I have to create a mechanism to get the files to S3. There are lots of moving parts for the application, one of the trickiest was the actual upload connection. I tried several techniques. Turns out amazon has a Form Post protocol that can be utilized. The important part is to create a “policy” in S3 parlance that is signed. The policy contains some specifics about the upload and an expiration.
Once I realized what to do it took a little playing around to create the correct html form post (I’m actually mimic it; but it looks the same on the receiving end). There is also a great utility to help create the policy – actually it was a big time saver for the prototype. Check it out on the amazon site: Policy Creator. It’s not pretty; but works really well.
The docs for the Post Protocol are in the Amazon Documentation section. Check them out and read carefully.
Chris….
While GAE might have great infrastructure for web applications (I’m using the java version), storing large amounts of data is pushing the envelop of what it can do. To store large files, I had to break each file up into a little less than 1MB chunks (1MB is the largest unit of storage for a GAE object). When I started to retrieve the 1MB objects, I would periodically run into GAE datastore timeout issues.
Retrieving large entities does not seem to be worked out yet. As a note, I would get timeout errors retrieving a single 1MB entity by it’s primary key. After fiddling around a bit, I have decided that I’m pushing the GAE infrastructure. So, I’m back to trying out Amazon’s S3. Now that I have a prototype of the application, it shouldn’t be too hard to port to S3.
In fact, I’ve been able to setup a S3 look alike instance on my local hardware using Eucalyptus; but that’s a story for another day.
At least for the time being I am going to try to create my application using AppEngine. They seem to bill based on actually consumed resources were Amazon would need a cpu running 24×7. I know I am technically consuming a resource by running the instance; but it will not be doing anything for most of that time. It would need to be on for the short periods of time when a user visits the site (one can dream about actually using an instance 24×7).
The next hurdle to overcome is google 1MB limit on things stored in their DB. We are uploading large files (700-1.4 GB). I am going to try and break things into 1MB chunks and see how that goes.
For a client I need to write a new client that will be used to a) create and manage user accounts b) upload very large files (cd/dvd images). So we want to use a cloud service that has large bandwidth and file capacity. In essence, we want the file upload process to be as quick as possible for the user. Hosting our own bandwidth (with simultaneous connections) would be prohibitively expensive – the client has no existing infrastructure for an application of this nature.
I am exporing using either Amazon EC2/S3 or Google AppEngine. I’m drawn at least intially to GAE since it is free to develop on. I’m hoping to architect the application so that I can easily port it to EC2/S3 if we desire. Some things to evaluate:
- what data store access method to use
- what upload code is required
For the second requirement, we are going to require Google Gears so that we can have background / restartable uploads. It’s always fun learning new things. Any experiences people care to share would be appreciated.