Data transfer between FSx and S3 bucket ========================================== This section describes two approaches for transferring data between an Amazon FSx for Lustre file system and an Amazon S3 bucket. Launch an EC2 instance for data transfer ----------------------------------------- In this approach, data transfer is performed through a dedicated EC2 instance. This instance is used **only for data movement**, not for computation. - Launch EC2 instance - Launch a **single EC2 instance** (not a ParallelCluster) - The instance must be in the **same VPC** as the FSx file system - Mount the FSx file system on the instance - Data transfer commands - Transfer data from S3 to FSx: ``aws s3 sync s3:///input /fsx/input`` - Transfer data from FSx to S3: ``aws s3 sync /fsx/output s3:///output`` - Terminate instance - After data transfer is complete, terminate the EC2 instance to avoid unnecessary charges. - Instance type recommendation: A compute-optimized instance type is recommended for data transfer tasks (e.g. ``c6i.large``). - Python tools for data transfer: Python package such as ``boto3`` for data downloading from S3 bucket. - Refer to the `detailed tutorial `_ for downloading GEOS-Chem input data on AWS. - `Example scripts `_ for downloading data on AWS. .. _dra: Data repository association (DRA) ----------------------------------------- FSx for Lustre **Data Repository Associations (DRA)** provide significantly higher performance than ``aws s3 sync`` for transferring data between FSx and S3. With DRA, data transfer is handled natively by the AWS service rather than through an EC2 instance. Requirements ^^^^^^^^^^^^ - The FSx file system and the S3 bucket reside in the **same AWS account** (account ID, not IAM user) - The FSx file system and the S3 bucket are in the **same region** - The S3 bucket to be linked allows FSx access specified in the :ref:`permissions `. - If these requirements are not met, DRA cannot be used or data cannot be loaded. Create DRA association ^^^^^^^^^^^^^^^^^^^^^^^ Through console ~~~~~~~~~~~~~~~~~ Specify through **Data repository import/export (DRA)** tab when creating a FSx file system Assume an FSx for Lustre file system is mounted on an EC2 instance at:: /fsx_input Three data repository associations (DRAs) are created with the following settings: - **DRA 1** - File system path: ``/ExtData`` - S3 path: ``s3://dzhang-imi-gchp-test/ExtData`` - **DRA 2** - File system path: ``/blended-tropomi`` - S3 path: ``s3://dzhang-imi-gchp-test/blended-tropomi`` - **DRA 3** - File system path: ``/blended-boundary-conditions`` - S3 path: ``s3://dzhang-imi-gchp-test/blended-boundary-conditions`` .. note:: - **Select all import policies and deselect all export policies** so that S3 → FSx synchronization is enabled while FSx → S3 synchronization is disabled. - On AWS console, we cannot create multiple DRAs at once. We can modify DRA settings after FSx is created by: - Go to AWS Console → FSx - Select your FSx for Lustre file system - Open the **Data repository** tab - Click **Create data repository association** In this case, the linked S3 data will appear locally on the EC2 instance as:: /fsx_input/ExtData/ /fsx_input/blended-tropomi/ The local mount point (``/fsx_input``) corresponds to the root of the FSx file system. Each data repository association creates a directory directly under this root, with contents mirrored from the associated S3 prefix. Through CLI ~~~~~~~~~~~~~ Data repository associations (DRA) can be added during creation or afterwards using ``aws fsx create-data-repository-association``. You must provide an IAM Role ARN that FSx can assume to access S3 (trusted by fsx.amazonaws.com and allowed S3 actions). .. code-block:: bash # Create two Data Repository Associations (DRAs). # Note: DRA file system paths cannot overlap (e.g., /ExtData and /ExtData/subdir). # DRAs are supported on FSx for Lustre 2.12/2.15 file systems (excluding scratch_1). # DRA 1: Import-only (recommended for static input data like ExtData) aws fsx create-data-repository-association \ --file-system-id "$FSX_ID" \ --file-system-path "/ExtData" \ --data-repository-path "s3://dzhang-imi-gchp-test/ExtData" \ --batch-import-meta-data-on-create \ --s3 '{ "AutoImportPolicy": {"Events": ["NEW","CHANGED","DELETED"]} }' \ --tags Key=Name,Value=dra-extdata \ --client-request-token dra-extdata-001 # DRA 2: Import-only (or add AutoExportPolicy if you truly want FSx -> S3 sync) aws fsx create-data-repository-association \ --file-system-id "$FSX_ID" \ --file-system-path "/blended-tropomi" \ --data-repository-path "s3://dzhang-imi-gchp-test/blended-tropomi" \ --batch-import-meta-data-on-create \ --s3 '{ "AutoImportPolicy": {"Events": ["NEW","CHANGED","DELETED"]} }' \ --tags Key=Name,Value=dra-blended-tropomi \ --client-request-token dra-tropomi-001 # DRA 3: Import-only (or add AutoExportPolicy if you truly want FSx -> S3 sync) aws fsx create-data-repository-association \ --file-system-id "$FSX_ID" \ --file-system-path "/blended-boundary-conditions" \ --data-repository-path "s3://dzhang-imi-gchp-test/blended-boundary-conditions" \ --batch-import-meta-data-on-create \ --s3 '{ "AutoImportPolicy": {"Events": ["NEW","CHANGED","DELETED"]} }' \ --tags Key=Name,Value=dra-blended-bc \ --client-request-token dra-bc-001 Verify DRA exists .. code-block:: bash aws fsx describe-data-repository-associations \ --filters Name=file-system-id,Values="$FSX_ID" \ --query "Associations[*].{Path:FileSystemPath,S3:DataRepositoryPath,State:Lifecycle}" .. important:: FSx for Lustre accesses S3 through :ref:`an IAM service role ` trusted by ``fsx.amazonaws.com``. When enabling DRA, ensure that the associated IAM role has permission to ``s3:GetObject``, ``s3:PutObject``, and ``s3:ListBucket`` on the **linked S3 bucket or prefix**. Import and Export Semantics of DRA ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A Data Repository Association (DRA) does **not** provide real-time or bidirectional synchronization between FSx and S3. Instead, it implements **directional, policy-driven, and largely lazy data movement**. Understanding these semantics is critical when using FSx scratch file systems. FSx → S3 (Export Semantics) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **One-time export required for pre-existing data** Files that already exist in FSx **before** the DRA is created are **not** exported automatically. A one-time export task is required to establish a baseline copy in S3. - **Auto-export (after DRA creation)** When auto-export is enabled: - Newly created or modified files in FSx are **automatically exported** to S3 - The **full file contents** (not only metadata) are written to S3 - Export occurs **asynchronously** but usually within minutes - **Deletion behavior** - Deleting a file in FSx **does not delete** the corresponding object in S3 - S3 is treated as durable, append-oriented storage S3 → FSx (Import Semantics) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Fully lazy import model** With auto-import enabled: - Files stored in S3 (created **before or after** the DRA) are **not** proactively copied to FSx - **Both metadata and file contents** are imported **only when the file is accessed** (e.g., ``ls``, ``stat``, file open, or model read) - **Access-triggered behavior** On **first access** to a given file from FSx: - File metadata is imported into the FSx namespace - File data blocks are downloaded on demand - Subsequent accesses to the same file reuse the cached data and do not require re-downloading, unless the cache is evicted or the file changes - **Manual import tasks (optional)** Manual import tasks may be used to pre-populate directory structure and metadata, but file contents are still fetched lazily unless a full import is explicitly requested. Behavior summary ~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :widths: 30 35 35 :header-rows: 1 * - Operation - Result - Notes * - FSx → S3 (existing files) - Not exported automatically - One-time export required * - FSx → S3 (new or modified files) - Automatically exported - Full data copied asynchronously * - FSx file deletion - No effect on S3 - No automatic deletion * - S3 → FSx (any file) - Imported on access - Metadata and data are lazy * - Auto-import policy - Enables access-triggered import - No proactive copying .. note:: - DRA is particularly useful for large datasets - EC2-based ``aws s3 sync`` remains a flexible fallback option