Loading Data from S3 to FSx and Using S3 as Backup

This section describes how to load data from Amazon S3 into an FSx for Lustre file system using Data Repository Associations (DRA), and presents a recommended strategy for using S3 as durable storage to back up data when using FSx scratch file systems.

Background and Design Philosophy

FSx for Lustre scratch file systems provide very high I/O performance at low cost, but they are not durable storage:

  • Scratch FSx does not replicate data

  • Scratch FSx does not support backups

  • Data loss is possible in the event of service or hardware failure

Therefore:

  • FSx scratch must never be treated as the only copy of important data

  • Amazon S3 should be used as the authoritative source and/or backup

The recommended design is:

  • Input data: stored durably in S3, imported into FSx for fast access

  • Output data: written to FSx during computation and exported back to S3

Data Repository Associations (DRA)

A Data Repository Association (DRA) links a directory inside FSx to an S3 bucket or prefix.

For example:

  • FSx path /ExtDatas3://my-bucket/ExtData/

  • FSx path /outputs3://my-bucket/output/

After mounting FSx locally at /fsx_input, these appear as:

/fsx_input/ExtData
/fsx_input/output

A file system may have multiple DRAs, provided their FSx paths do not overlap.

How Data Is Loaded from S3 to FSx

It is important to understand that creating a DRA does not immediately copy all file contents from S3.

By default:

  • FSx does not proactively import metadata or file contents

  • Metadata and data are imported lazily when files are accessed

  • This behavior saves time and storage for large datasets

There are three ways to populate data from S3 to FSx.

Method 1: On-Demand Loading (Default)

If auto-import policies are enabled on the DRA:

  • Files and directories appear immediately under FSx

  • File contents are fetched automatically when first accessed

Example:

ls /fsx-input/ExtData
cat /fsx-input/ExtData/example.nc

This method is sufficient for most workflows and requires no manual action.

Method 3: Full Data Import (Required Before Deleting S3)

If you intend to delete the associated S3 bucket or make FSx fully self-contained, you must explicitly import all data.

Example:

aws fsx create-data-repository-task \
  --type IMPORT_FROM_REPOSITORY \
  --file-system-id fs-xxxxxxxx \
  --paths /ExtData

This command copies all file contents from S3 to FSx.

Warning

Do not delete the S3 bucket until all IMPORT_FROM_REPOSITORY tasks complete successfully. DRA availability alone does not guarantee data has been copied.

Monitor progress with:

aws fsx describe-data-repository-tasks

Proceed only when Lifecycle = COMPLETED.

FSx Storage Capacity Considerations

When performing a full import:

  • FSx storage capacity must be greater than or equal to the total logical size of data in S3 (Add some headroom is recommended)

  • Do not rely on compression to reduce required capacity

  • Imports will fail if FSx runs out of space

When Is It Safe to Delete the S3 Bucket?

It is safe to delete the S3 bucket only if all of the following are true:

  • Full IMPORT_FROM_REPOSITORY tasks have completed

  • FSx storage usage is stable (verified with lfs df)

  • Files remain readable after cache drops

  • You understand that FSx scratch provides no recovery or backups

In most cases, deleting the S3 bucket is not recommended. Keeping S3 as the durable source of truth is safer and usually cheaper.

Summary

  • FSx scratch provides fast but non-durable storage

  • DRA does not automatically copy all file contents

  • Use explicit import tasks to fully populate FSx if needed

  • Always use S3 as the durable source and/or backup

  • Never treat FSx scratch as the only copy of important data