Upload Data to an Amazon S3 Bucket

This tutorial describes common and recommended methods for uploading data to an existing Amazon S3 bucket.

Typical use cases include:

Uploading model input data

Storing simulation outputs

Backing up data from EC2 or FSx for Lustre

Sharing data across AWS services

This guide assumes that the S3 bucket already exists.

Prerequisites

Before uploading data to S3, ensure that:

You have access to an existing S3 bucket

Your IAM user or role has permission to write to the bucket

AWS CLI is installed and configured if using command-line methods

Upload Input Data Using Python Scripts (Recommended)

There is no data transferring fee if the two S3 buckets are in the same region with only tiny S3 bucket request fee.
There is storage fee associated with a copied S3 bucket, but the cost is much cheaper than FSx for Lustre and compute cost, see pricing
We first transfer data to S3 in the same AWS account and then we can utilize DRA to facilitate super fast transferring from S3 to FSx (much faster than aws s3 sync)
All input data is available in S3 bucket(s), we can copy data in-between S3 buckets.
List of source bucket name for input data

Input data	S3 bucket name	Downloading scripts
GEOS-Chem	geos-chem (us-west-2); gcgrid (us-east-1)	`copy_geoschem_inputdata_s3_to_s3.py`
Blended TROPOMI	blended-tropomi-gosat-methane (us-west-2)	`copy_blended_TROPOMI_s3_to_s3.py`
IMI BC conditions	imi-boundary-condition (us-east-1)	`copy_imi_boundary_conditions_s3_to_s3.py`

Check the bucket region by:

aws s3api head-bucket --bucket <bucket_name>

Note

You need to run the helper scripts to download inputs with AWS credential either through IAM user credential or assumed credential from an EC2 instance
The region is listed in the bracket adjacent to S3 bucket name. For example, the bucket name is gcgrid instead of gcgrid (us-east-1)

Downloading scripts tutorials

For GEOS-Chem input data

Generate dryrun log using GCClassic dryrun option, see the instructions here

Use the helper scripts copy_geoschem_inputdata_s3_to_s3.py

python copy_geoschem_inputdata_s3_to_s3.py \
  <dryrun_log> \
  --dest-bucket <bucket> \
  [--dest-prefix <prefix>] \
  [--dryrun]

An example dry run:

python copy_geoschem_inputdata_s3_to_s3.py \
  geoschem.dryrun.log \
  --dest-bucket dzhang-imi-gchp-test \
  --dest-prefix ExtData \
  --dryrun

You will see similar output:

=============Mirror ExtData referenced by dry-run log=============
Log:        geoschem.dryrun.log
Source:     s3://gcgrid/
Dest:       s3://dzhang-imi-gchp-test/ExtData/
Mode:       found+missing
Objects:    77
Dryrun:     True | Overwrite: False
===================================================================
[DRYRUN] s3://gcgrid/GEOS_0.25x0.3125/GEOS_FP/2011/01/GEOSFP.20110101.CN.025x03125.nc
-> s3://dzhang-imi-gchp-test/ExtData/GEOS_0.25x0.3125/GEOS_FP/2011/01/
GEOSFP.20110101.CN.025x03125.nc
...
===================================================================
Done. copied=0 skipped=0 failed=0 dryrun=77
===================================================================

If you are satisfied with the dryrun output, just remove --dryrun for a real copy

For satellite data

User the helper scripts copy_blended_TROPOMI_s3_to_s3.py.

python copy_blended_TROPOMI_s3_to_s3.py \
  <start-time> <end-time> \
  <dest-bucket> \
  [--dest-prefix PREFIX] \
  [--src-bucket NAME] \
  [--dryrun]

An example dry run:

python copy_blended_TROPOMI_s3_to_s3.py \
  20240501 20240502 dzhang-imi-gchp-test \
  --dest-prefix blended-tropomi/ --dryrun

You will see similar message like:

=============Copying S3 -> S3 (date-filtered)=============
Range: [20240501, 20240502)
Source: s3://blended-tropomi-gosat-methane/
Dest:   s3://dzhang-imi-gchp-test/blended-tropomi/
Files matched: 15
[DRYRUN] copy s3://blended-tropomi-gosat-methane/data/2024-05/
S5P_BLND_L2__CH4____20240501T115126_20240501T133256_33938_03_020600_20240918T125812.nc
-> s3://dzhang-imi-gchp-test/blended-tropomi/
S5P_BLND_L2__CH4____20240501T115126_20240501T133256_33938_03_020600_20240918T125812.nc
...
Done. Success: 15, Failed: 0

If you are satisfied with the dryrun output, just remove --dryrun for a real copy

For boundary conditions from satellite data

User the helper scripts copy_imi_boundary_conditions_s3_to_s3.py.

python copy_imi_boundary_conditions_s3_to_s3.py \
  <start_yyyymmdd> <end_yyyymmdd> \
  <dest-bucket> <vYYYY-MM> \
  [--dest-prefix PREFIX] \
  [--src-bucket imi-boundary-conditions] \
  [--src-prefix PREFIX] \
  [--dryrun]

An example dry run:

python copy_imi_boundary_conditions_s3_to_s3.py \
  20240501 20240502 \
  dzhang-imi-gchp-test \
  v2025-06-blended \
  --dest-prefix blended-boundary-conditions/ --dryrun

You will see similar message like:

=============Mirror IMI Boundary Conditions (S3 → S3)=============
Range:      [20240501, 20240502)
Source:     s3://imi-boundary-conditions/v2025-06-blended/
Dest:       s3://dzhang-imi-gchp-test/blended-boundary-conditions/
Key rule:   dst_key = <dest_prefix> + <src_key>
Dryrun:     True | Overwrite: False
===================================================================
Objects matched: 1
[DRYRUN] s3://imi-boundary-conditions/v2025-06-blended/GEOSChem.BoundaryConditions.20240501_0000z.nc4
-> s3://dzhang-imi-gchp-test/blended-boundary-conditions/v2025-06-blended/
GEOSChem.BoundaryConditions.20240501_0000z.nc4
===================================================================
Done. copied=0 skipped=0 failed=0 dryrun=1
===================================================================

If you are satisfied with the dryrun output, just remove --dryrun for a real copy

Upload Output Data from FSx for Lustre (Best for Large Outputs)

When using FSx for Lustre with an S3 Data Repository Association (DRA), data written to FSx can be automatically synchronized to S3.

Typical workflow:

/fsx/
└── output/
    └── Test_Global_1day_c36s10/

With DRA configured:

Files written to FSx appear in the associated S3 bucket

No explicit aws s3 cp command is required

This method is recommended for:

Large-scale model output

ParallelCluster workflows

Repeated data transfers

Note

FSx and S3 must be in the same AWS account and region for DRA to work.

Other Official Methods

Upload Data Using AWS CLI

The AWS CLI is the preferred method for uploading data in most research and HPC workflows. It is scriptable, restartable, and suitable for large datasets.

Upload a Single File

Use aws s3 cp to upload an individual file:

aws s3 cp local_file.nc s3://my-bucket/path/local_file.nc

Example:

aws s3 cp emis_2020.nc s3://imi-gchp-test/emissions/emis_2020.nc

Upload a Directory Recursively

To upload an entire directory and preserve its structure:

aws s3 cp local_directory/ s3://my-bucket/path/ --recursive

Example:

aws s3 cp ExtData/ s3://acmg-input-data/ExtData/ --recursive

This method is suitable for initial uploads of structured datasets.

Synchronize a Directory (Recommended)

Use aws s3 sync to upload only new or modified files:

aws s3 sync local_directory/ s3://my-bucket/path/

Example:

aws s3 sync output/ s3://acmg-gchp-output/Test_Global_1day_c360/

Advantages of sync:

Skips unchanged files

Safe for repeated execution

Ideal for incremental model outputs

Upload Data Using AWS Management Console

This method is suitable only for small files or one-off uploads.

Steps:

Open the S3 service in the AWS Management Console

Select the target bucket

Click Upload

Drag and drop files or select them manually

Click Upload

Limitations:

Not suitable for large datasets

No automation

Browser-dependent

Common Permission Requirements

Uploading data to S3 typically requires the following IAM permissions:

s3:PutObject
s3:PutObjectAcl
s3:ListBucket

These permissions must apply to:

The bucket itself

All objects within the bucket

If uploading from EC2 or ParallelCluster:

The instance IAM role must have these permissions

User permissions on your local machine do not apply

Verification

To verify uploaded objects:

aws s3 ls s3://my-bucket/path/

To recursively list contents:

aws s3 ls s3://my-bucket/path/ --recursive

Note

Although S3 paths resemble Linux directories, S3 is not a filesystem: operations such as mv or rename rewrite object keys rather than modifying directory metadata.
Always include the trailing / when operating on a “folder”. The trailing / tells the AWS CLI to treat the path as a prefix rather than a single object. Without it, the command applies only to the object with that exact key, not to everything under the prefix.
When applying an operation to all objects under a prefix, also include --recursive; otherwise, the command will not descend into the pseudo-directory.

Checking S3 Bucket Size (for FSx Planning)

Before importing data from S3 into FSx, determine the total logical size of the S3 bucket or prefix. This value should be used to size FSx storage.

Use the AWS CLI (recommended):

aws s3 ls s3://dzhang-imi-gchp-test --recursive --summarize \
  | awk '/Total Size/ {printf "%.2f GB\n", $3/1024/1024/1024}'

To check the size of a specific prefix (useful when multiple DRAs are used):

aws s3 ls s3://dzhang-imi-gchp-test/ExtData --recursive --summarize \
  | awk '/Total Size/ {printf "%.2f GB\n", $3/1024/1024/1024}'
aws s3 ls s3://dzhang-imi-gchp-test/blended-tropomi --recursive --summarize \
  | awk '/Total Size/ {printf "%.2f GB\n", $3/1024/1024/1024}'

Use the AWS console:

Go to your S3 bucket
Click on Metrics tab
Check Total bucket size

Notes and Best Practices

Prefer aws s3 sync for repeated uploads
Keep S3 buckets private by default
Upload data from the same AWS region whenever possible
Use FSx DRA for high-throughput workflows
Avoid browser uploads for large or critical datasets

Next Steps

After uploading data, you may want to:

Configure bucket access policies for collaborators

Associate the bucket with FSx for Lustre

Automate uploads in batch or workflow scripts