Implementing S3 Delta-Upload for static-aws-deploy

About two months ago I wrote this post describing how and why I built static-aws-deploy, which I’ve been using to update my blog since then.

So far the tool has worked well, but I recently noticed that the number of files to upload each time grows steadily and I decided to tackle delta-uploading before it gets out of hand.

A delta-upload is just a fancy way of saying only upload files which have changed or are new - so instead of updating all the files on my blog every time I publish a post, I only want to upload the new files and the files which have changes (e.g.: the RSS feed), which will save a bit of bandwidth, lots of requests to the API and will make the tool more flexible for small changes like fixing typos after publishing.

Let’s take a look at what I came up with!

Approach

There are several ways of finding out which files are new or changed. In this case, I decided to use the lastModifiedDate of the files, as well as their md5 hash. A file will be marked as new or changed, if the lastModifiedDate of the local file is after the file’s which is already uploaded AND if the md5 hashes don’t match up.

The reason why I used md5 is simply because the S3 REST API has a way of fetching all files within a bucket, which is documented here, which also includes an ETag property, which is the md5 hash of the file.

From an architectural perspective, static-aws-deploy already walks all the files in order to collect them and generate their Headers for uploading, so hooking in there seemed like a good idea.

Implementation

First, I needed to add a flag to enable delta-uploading to keep backwards compatibility:

var (
    ...
    delta      bool
)

flag.BoolVar(&delta, "delta", false, "only upload changed files")
flag.BoolVar(&delta, "d", false, "only upload changed files (shorthand)")

The next step was to pass this flag to the upload.ParseFiles function.

files, err := upload.ParseFiles(&config.S3, delta)

In this function, if the delta flag is set, a request needs to be made to AWS S3 to fetch infos about all files within the bucket and put them into a nice data structure for later use. The datastructure looks like this:

type Delta map[string]*DeltaProperties

type DeltaProperties struct {
    LastModified time.Time
    ETag         string
}

Basically, it just maps from a file to its properties we need to determine whether it has been changed or not. In order to create and populate this map, I wrote the getDeltaMap function, which requests AWS S3, parses the XML with etree and returns a Delta and handles all errors.

it looks something like this (error handling omitted for brevity):

func getDeltaMap(config *Config) (Delta, error) {
    // make authenticated AWS S3 request
    client := &http.Client{}
    req, err := http.NewRequest("GET", fmt.Sprintf("https://s3.amazonaws.com/%s/?list-type=2", config.Bucket.Name), nil)
    awsauth.Sign(req, awsauth.Credentials{
        AccessKeyID:     config.Bucket.Accesskey,
        SecretAccessKey: config.Bucket.Key,
    })
    resp, err := client.Do(req)

    // parse xml
    doc := etree.NewDocument()
    doc.ReadFrom(resp.Body);
    root := doc.SelectElement("ListBucketResult")
    contents := root.SelectElements("Contents")

    // build map
    deltaMap := make(Delta)
    for _, file := range contents {
        lastModified := file.SelectElement("LastModified")
        etag := file.SelectElement("ETag")
        key := file.SelectElement("Key")
        parsedLastModified, err := time.Parse(time.RFC3339Nano, lastModified.Text())
        deltaProp := DeltaProperties{
            ETag:           strings.Trim(etag.Text(), "\""),
            LastModified:   parsedLastModified,
        }
        deltaMap[key.Text()] = &deltaProp
    }
    return deltaMap, nil
}

Nothing too surprising here, except that we need to Trim some ""s from the ETag, because the XML is like that. With this in place, we already have the key component for our delta-upload - a way to check if the files we are planning to upload are different from the ones already on the server.

The next step was to find out the lastModifiedDate and md5 hash of our local files. For this purpose, I created the hasFileChanged function as follows:

func hasFileChanged(info os.FileInfo, deltaMap Delta, uploadPath, filePath string) (bool, error) {
    etag, err := calculateETag(filePath)
    if err != nil {
        return false, err
    }
    deltaProps := deltaMap[uploadPath]
    if deltaProps != nil {
        return etag != deltaProps.ETag && info.ModTime().After(deltaProps.LastModified), nil
    }
    return true, nil
}

This function will be called while walking all the files to be uploaded. It gets an os.FileInfo, the deltaMap and two paths. One of these paths is the normal, local filepath we need to calculate the md5 hash and the other one is the cleaned path, which matches the Key of the files already uploaded which are mapped within deltaMap.

First, the ETag of the local file is calculated:

func calculateETag(path string) (string, error) {
    bytes, err := ioutil.ReadFile(path)
    if err != nil {
        return "", fmt.Errorf("could not read file: %s while calculating it's ETag, %v", path, err)
    }
    return fmt.Sprintf("%x", md5.Sum(bytes)), nil
}

And then, for each file within deltaMap, we check if the ETags match and if the file was modified after it was uploaded. This way we will mark both changed and new files.

Armed with these functions, it was trivial to extend the file-walking logic:

// in ParseFiles(...)
...
deltaMap := make(Delta)
if delta {
    deltaMap, err = getDeltaMap(config)
    if err != nil {
        return nil, err
    }
}
err = filepath.Walk(source, func(path string, info os.FileInfo, err error) error {
    if !info.IsDir() && !re.MatchString(path) {
        hasChanged := true
        if delta {
            hasChanged, err = hasFileChanged(info, deltaMap, getUploadPath(config, path), path)
            if err != nil {
                return err
            }
        }
        if hasChanged {
            result[path] = []Header{}
        }
    }
    return nil
})

Alright, that’s it. Now running static-aws-deploy --dr -d on my blog source reveals the following:

0 Files to upload (4 concurrently)...
Delta Dry Run finished.

And after changing the RSS-feed (index.xml) manually:

1 Files to upload (4 concurrently)...
www/index.xml...Done.
Delta Dry Run finished.

Done!

You can check out the full changelog of the commit here.

Conclusion

A fun little feature for a useful little tool - nice!

This post has already been successfully uploaded using this mechanism, which will hopefully save time, bandwidth and requests for many posts to come! ;)

Resources


I work as a freelance software engineer and trainer and you can hire me. If you like my work, or my writing and if I sound like someone you'd like to work with, let's talk! :)