.. _fsx-s3-data-management: Loading Data from S3 to FSx and Using S3 as Backup ================================================== This section describes how to load data from Amazon S3 into an FSx for Lustre file system using Data Repository Associations (DRA), and presents a recommended strategy for using S3 as durable storage to back up data when using **FSx scratch** file systems. Background and Design Philosophy -------------------------------- FSx for Lustre **scratch** file systems provide very high I/O performance at low cost, but they are **not durable storage**: - Scratch FSx does **not** replicate data - Scratch FSx does **not** support backups - Data loss is possible in the event of service or hardware failure Therefore: - **FSx scratch must never be treated as the only copy of important data** - **Amazon S3 should be used as the authoritative source and/or backup** The recommended design is: - **Input data**: stored durably in S3, imported into FSx for fast access - **Output data**: written to FSx during computation and exported back to S3 Data Repository Associations (DRA) ---------------------------------- A :ref:`Data Repository Association (DRA) ` links a directory inside FSx to an S3 bucket or prefix. For example: - FSx path ``/ExtData`` ↔ ``s3://my-bucket/ExtData/`` - FSx path ``/output`` ↔ ``s3://my-bucket/output/`` After mounting FSx locally at ``/fsx_input``, these appear as:: /fsx_input/ExtData /fsx_input/output A file system may have **multiple DRAs**, provided their FSx paths do not overlap. How Data Is Loaded from S3 to FSx --------------------------------- It is important to understand that **creating a DRA does not immediately copy all file contents from S3**. By default: - FSx does **not** proactively import metadata or file contents - Metadata and data are imported **lazily** when files are accessed - This behavior saves time and storage for large datasets There are three ways to populate data from S3 to FSx. Method 1: On-Demand Loading (Default) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If auto-import policies are enabled on the DRA: - Files and directories appear immediately under FSx - File contents are fetched automatically when first accessed Example:: ls /fsx-input/ExtData cat /fsx-input/ExtData/example.nc This method is sufficient for most workflows and requires no manual action. Method 2: Batch Metadata Import (Recommended After Creation) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To ensure the full directory structure appears immediately, run a metadata import task. Auto-import policies are enabled by default, you can check by: - Checking Auto-Import Policies via AWS Console 1. Go to **AWS Console → FSx** 2. Select your **FSx for Lustre** file system 3. Open the **Data repository** tab 4. In the **Data repository associations** table, check the **Import policy** column for each DRA Possible values include: - ``NEW, CHANGED, DELETED`` Auto-import is **enabled**. FSx will automatically detect new, modified, and deleted objects in the associated S3 path. - ``None`` or an empty value Auto-import is **not enabled**. FSx will not automatically notice changes in S3. - Checking Auto-Import Policies via AWS CLI You can also verify auto-import policies using the AWS CLI: .. code-block:: bash aws fsx describe-data-repository-associations For a more readable summary: .. code-block:: bash aws fsx describe-data-repository-associations \ --query "Associations[*].{ \ FSxPath:FileSystemPath, \ ImportPolicy:S3.AutoImportPolicy.Events, \ ExportPolicy:S3.AutoExportPolicy.Events \ }" Example output:: [ { "FSxPath": "/ExtData", "ImportPolicy": ["NEW", "CHANGED", "DELETED"], "ExportPolicy": null }, { "FSxPath": "/blended-tropomi", "ImportPolicy": ["NEW", "CHANGED", "DELETED"], "ExportPolicy": ["NEW", "CHANGED", "DELETED"] } ] Interpretation: - ``ImportPolicy`` containing one or more events (``NEW``, ``CHANGED``, ``DELETED``) indicates that **auto-import is enabled** - ``ImportPolicy = null`` indicates that **auto-import is disabled** - ``ExportPolicy`` shows whether automatic export (FSx → S3) is enabled - What Auto-Import Policies Do (and Do Not Do) Auto-import policies provide the following behavior: - Changes in S3 are automatically detected - Directory structure and metadata in FSx are kept in sync - File contents are fetched **on demand** when accessed Auto-import policies **do not** guarantee that: - All file contents have already been copied to FSx - FSx remains usable if the S3 bucket is deleted To fully copy all data from S3 to FSx, an explicit ``IMPORT_FROM_REPOSITORY`` task must be executed. Example:: aws fsx create-data-repository-task \ --type IMPORT_METADATA_FROM_REPOSITORY \ --file-system-id fs-xxxxxxxx \ --paths /ExtData Repeat for other DRA paths as needed. This imports **only metadata**, not file contents. Method 3: Full Data Import (Required Before Deleting S3) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you intend to delete the associated S3 bucket or make FSx fully self-contained, you **must explicitly import all data**. Example:: aws fsx create-data-repository-task \ --type IMPORT_FROM_REPOSITORY \ --file-system-id fs-xxxxxxxx \ --paths /ExtData This command copies **all file contents** from S3 to FSx. .. warning:: Do **not** delete the S3 bucket until all ``IMPORT_FROM_REPOSITORY`` tasks complete successfully. DRA availability alone does not guarantee data has been copied. Monitor progress with:: aws fsx describe-data-repository-tasks Proceed only when ``Lifecycle = COMPLETED``. FSx Storage Capacity Considerations ----------------------------------- When performing a full import: - FSx storage capacity must be **greater than or equal to** the total logical size of data in S3 (Add some headroom is recommended) - Do **not** rely on compression to reduce required capacity - Imports will fail if FSx runs out of space Recommended Backup Strategy Using S3 ------------------------------------ The recommended and safe pattern for FSx scratch usage is: Input Data (Read-Only) ^^^^^^^^^^^^^^^^^^^^^^ - **Authoritative copy**: S3 - **Working copy**: FSx scratch - **DRA policy**: - Import: NEW, CHANGED, DELETED - Export: disabled If FSx fails: - Recreate FSx - Re-import from S3 - No data loss Output Data (Write-Back) ^^^^^^^^^^^^^^^^^^^^^^^^^^ - **Working location**: FSx scratch - **Backup location**: S3 (separate bucket or prefix) - **DRA policy**: - Export enabled (auto-export or manual) - Output data is continuously or periodically written to S3 If FSx fails: - Completed output already exists in S3 - Only in-progress work may be lost .. note:: S3 provides high durability at low cost and should always be used to store important or irreplaceable data when using FSx scratch. When Is It Safe to Delete the S3 Bucket? ---------------------------------------- It is safe to delete the S3 bucket **only if all of the following are true**: - Full ``IMPORT_FROM_REPOSITORY`` tasks have completed - FSx storage usage is stable (verified with ``lfs df``) - Files remain readable after cache drops - You understand that FSx scratch provides no recovery or backups In most cases, deleting the S3 bucket is **not recommended**. Keeping S3 as the durable source of truth is safer and usually cheaper. Summary -------- - FSx scratch provides fast but non-durable storage - DRA does not automatically copy all file contents - Use explicit import tasks to fully populate FSx if needed - Always use S3 as the durable source and/or backup - Never treat FSx scratch as the only copy of important data