Data transfer between FSx and S3 bucket

This section describes two approaches for transferring data between an Amazon FSx for Lustre file system and an Amazon S3 bucket.

Launch an EC2 instance for data transfer

In this approach, data transfer is performed through a dedicated EC2 instance. This instance is used only for data movement, not for computation.

  • Launch EC2 instance

    • Launch a single EC2 instance (not a ParallelCluster)

    • The instance must be in the same VPC as the FSx file system

    • Mount the FSx file system on the instance

  • Data transfer commands

    • Transfer data from S3 to FSx: aws s3 sync s3://<bucket>/input /fsx/input

    • Transfer data from FSx to S3: aws s3 sync /fsx/output s3://<bucket>/output

  • Terminate instance

    • After data transfer is complete, terminate the EC2 instance to avoid unnecessary charges.

    • Instance type recommendation: A compute-optimized instance type is recommended for data transfer tasks (e.g. c6i.large).

  • Python tools for data transfer: Python package such as boto3 for data downloading from S3 bucket.

Data repository association (DRA)

FSx for Lustre Data Repository Associations (DRA) provide significantly higher performance than aws s3 sync for transferring data between FSx and S3.

With DRA, data transfer is handled natively by the AWS service rather than through an EC2 instance.

Requirements

  • The FSx file system and the S3 bucket reside in the same AWS account (account ID, not IAM user)

  • The FSx file system and the S3 bucket are in the same region

  • The S3 bucket to be linked allows FSx access specified in the permissions.

  • If these requirements are not met, DRA cannot be used or data cannot be loaded.

Create DRA association

Through console

Specify through Data repository import/export (DRA) tab when creating a FSx file system

Assume an FSx for Lustre file system is mounted on an EC2 instance at:

/fsx_input

Three data repository associations (DRAs) are created with the following settings:

  • DRA 1

    • File system path: /ExtData

    • S3 path: s3://dzhang-imi-gchp-test/ExtData

  • DRA 2

    • File system path: /blended-tropomi

    • S3 path: s3://dzhang-imi-gchp-test/blended-tropomi

  • DRA 3

    • File system path: /blended-boundary-conditions

    • S3 path: s3://dzhang-imi-gchp-test/blended-boundary-conditions

Note

  • Select all import policies and deselect all export policies so that S3 → FSx synchronization is enabled while FSx → S3 synchronization is disabled.

  • On AWS console, we cannot create multiple DRAs at once. We can modify DRA settings after FSx is created by: - Go to AWS Console → FSx - Select your FSx for Lustre file system - Open the Data repository tab - Click Create data repository association

In this case, the linked S3 data will appear locally on the EC2 instance as:

/fsx_input/ExtData/
/fsx_input/blended-tropomi/

The local mount point (/fsx_input) corresponds to the root of the FSx file system. Each data repository association creates a directory directly under this root, with contents mirrored from the associated S3 prefix.

Through CLI

Data repository associations (DRA) can be added during creation or afterwards using aws fsx create-data-repository-association.

You must provide an IAM Role ARN that FSx can assume to access S3 (trusted by fsx.amazonaws.com and allowed S3 actions).

# Create two Data Repository Associations (DRAs).
# Note: DRA file system paths cannot overlap (e.g., /ExtData and /ExtData/subdir).
# DRAs are supported on FSx for Lustre 2.12/2.15 file systems (excluding scratch_1).

# DRA 1: Import-only (recommended for static input data like ExtData)
aws fsx create-data-repository-association \
  --file-system-id "$FSX_ID" \
  --file-system-path "/ExtData" \
  --data-repository-path "s3://dzhang-imi-gchp-test/ExtData" \
  --batch-import-meta-data-on-create \
  --s3 '{
    "AutoImportPolicy": {"Events": ["NEW","CHANGED","DELETED"]}
  }' \
  --tags Key=Name,Value=dra-extdata \
  --client-request-token dra-extdata-001

# DRA 2: Import-only (or add AutoExportPolicy if you truly want FSx -> S3 sync)
aws fsx create-data-repository-association \
  --file-system-id "$FSX_ID" \
  --file-system-path "/blended-tropomi" \
  --data-repository-path "s3://dzhang-imi-gchp-test/blended-tropomi" \
  --batch-import-meta-data-on-create \
  --s3 '{
    "AutoImportPolicy": {"Events": ["NEW","CHANGED","DELETED"]}
  }' \
  --tags Key=Name,Value=dra-blended-tropomi \
  --client-request-token dra-tropomi-001

# DRA 3: Import-only (or add AutoExportPolicy if you truly want FSx -> S3 sync)
aws fsx create-data-repository-association \
  --file-system-id "$FSX_ID" \
  --file-system-path "/blended-boundary-conditions" \
  --data-repository-path "s3://dzhang-imi-gchp-test/blended-boundary-conditions" \
  --batch-import-meta-data-on-create \
  --s3 '{
    "AutoImportPolicy": {"Events": ["NEW","CHANGED","DELETED"]}
  }' \
  --tags Key=Name,Value=dra-blended-bc \
  --client-request-token dra-bc-001

Verify DRA exists

aws fsx describe-data-repository-associations \
  --filters Name=file-system-id,Values="$FSX_ID" \
  --query "Associations[*].{Path:FileSystemPath,S3:DataRepositoryPath,State:Lifecycle}"

Important

FSx for Lustre accesses S3 through an IAM service role trusted by fsx.amazonaws.com. When enabling DRA, ensure that the associated IAM role has permission to s3:GetObject, s3:PutObject, and s3:ListBucket on the linked S3 bucket or prefix.

Import and Export Semantics of DRA

A Data Repository Association (DRA) does not provide real-time or bidirectional synchronization between FSx and S3. Instead, it implements directional, policy-driven, and largely lazy data movement.

Understanding these semantics is critical when using FSx scratch file systems.

FSx → S3 (Export Semantics)

  • One-time export required for pre-existing data

    Files that already exist in FSx before the DRA is created are not exported automatically. A one-time export task is required to establish a baseline copy in S3.

  • Auto-export (after DRA creation)

    When auto-export is enabled:

    • Newly created or modified files in FSx are automatically exported to S3

    • The full file contents (not only metadata) are written to S3

    • Export occurs asynchronously but usually within minutes

  • Deletion behavior

    • Deleting a file in FSx does not delete the corresponding object in S3

    • S3 is treated as durable, append-oriented storage

S3 → FSx (Import Semantics)

  • Fully lazy import model

    With auto-import enabled:

    • Files stored in S3 (created before or after the DRA) are not proactively copied to FSx

    • Both metadata and file contents are imported only when the file is accessed (e.g., ls, stat, file open, or model read)

  • Access-triggered behavior

    On first access to a given file from FSx:

    • File metadata is imported into the FSx namespace

    • File data blocks are downloaded on demand

    • Subsequent accesses to the same file reuse the cached data and do not require re-downloading, unless the cache is evicted or the file changes

  • Manual import tasks (optional)

    Manual import tasks may be used to pre-populate directory structure and metadata, but file contents are still fetched lazily unless a full import is explicitly requested.

Behavior summary

Operation

Result

Notes

FSx → S3 (existing files)

Not exported automatically

One-time export required

FSx → S3 (new or modified files)

Automatically exported

Full data copied asynchronously

FSx file deletion

No effect on S3

No automatic deletion

S3 → FSx (any file)

Imported on access

Metadata and data are lazy

Auto-import policy

Enables access-triggered import

No proactive copying

Note

  • DRA is particularly useful for large datasets

  • EC2-based aws s3 sync remains a flexible fallback option