Saturday, July 16, 2016

AWS: Syncing a Local Directory to an S3 Storage Bucket


The S3 PUT operation only supports uploading one object per HTTP request.

This can be problematic when thousands (or even millions) of files need to pushed (or synced) with an S3 bucket.

Install s3cmd from Github:
$ git clone https://github.com/s3tools/s3cmd.git
$ python setup.py install clean

A user with credentials and the appropriate access policy must exist.

s3cmd will require an Access Key ID and Secret Access Key from AWS (see references below).

A policy needs to be attached to the user you create in this step.

I chose to grant administrator access, although there is undoubtedly a finer grain of control that can be granted for S3 IO:
Configuring AWS Credentials (private information redacted)

Once installed, s3cmd needs to be configured.

Assuming you have already created a bucket in S3,  s3cmd is configured like this:
$ ./s3cmd --configure s3://<bucket>/

You will be asked for your access key and your secret key.

I accepted the default for region [US], and left the encryption password and path to GPG empty.

I selected No for use of https.  Finally, I left the proxy server option empty as well.

If the configuration gives a 403 error, re-check your credentials and access policy (above).

A successful configuration attempt looks like this (private information redacted):
~/workspaces/public/vagrant/s3cmd $ ./s3cmd --configure s3://***/

Enter new values or accept defaults in brackets with Enter.
Refer to user manual for detailed description of all options.

Access key and Secret key are your identifiers for Amazon S3. Leave them empty for using the env variables.
Access Key: ***
Secret Key: ***
Default Region [US]: 

Encryption password is used to protect your files from reading
by unauthorized persons while in transfer to S3
Encryption password: 
Path to GPG program [/usr/local/bin/gpg]: 

When using secure HTTPS protocol all communication with Amazon S3
servers is protected from 3rd party eavesdropping. This method is
slower than plain HTTP, and can only be proxied with Python 2.7 or newer
Use HTTPS protocol [No]: 

On some networks all internet access must go through a HTTP proxy.
Try setting it here if you can't connect to S3 directly
HTTP Proxy server name: 

New settings:
  Access Key: ***
  Secret Key: ***
  Default Region: US
  Encryption password: 
  Path to GPG program: 
  Use HTTPS protocol: False
  HTTP Proxy server name: 
  HTTP Proxy server port: 0

Test access with supplied credentials? [Y/n] Y
Please wait, attempting to list bucket: s3://***/
Success. Your access key and secret key worked fine :-)

Now verifying that encryption works...
Not configured. Never mind.

Save settings? [y/N] y
Configuration saved to '/Users/craigtrim/.s3cfg'

Usage is straightforward:
$ ./s3cmd sync ~/Documents/files/ s3://<bucket>/


References

  1. [Amazon] AWS Credentials
  2. [StackOverflow] Batch Uploads to S3
  3. [s3Cmd] S3 Sync Howto

3 comments: