Backend & DevOps Blog

Real-world experiences with MongoDB, Docker, Kubernetes and more

Automating MongoDB Backups with GitHub Actions

Database backups are like insurance policies—you hope you'll never need them, but when disaster strikes, you'll be grateful you invested the time to set them up properly. After our team experienced a near-catastrophic data loss event, we decided to implement a robust, automated MongoDB backup solution using GitHub Actions. The journey involved navigating permission issues, safely managing credentials, and ensuring backups could be reliably restored when needed.

The Incident That Started It All

It was a normal Tuesday afternoon when our MongoDB instance suddenly became unresponsive. After investigation, we discovered that a well-intentioned but ill-advised schema migration script had run in production without proper testing, corrupting critical collections. Our most recent backup was four days old, meaning we potentially faced substantial data loss.

We managed to recover most of the data through a combination of database journal files and application logs, but the experience highlighted a critical gap in our infrastructure: we lacked a reliable, automated backup system.

Initial Approach: A Simple Cron Job

Our first thought was to set up a simple cron job on a dedicated server:

# /etc/cron.d/mongodb-backup
0 2 * * * ec2-user /usr/local/bin/mongodb-backup.sh > /var/log/mongodb-backup.log 2>&1

With a basic shell script:

#!/bin/bash
# mongodb-backup.sh
DATE=$(date +%Y-%m-%d)
MONGO_URI="mongodb://username:password@localhost:27017/mydb"

# Create backup
mongodump --uri="$MONGO_URI" --out="/backups/$DATE"

# Compress backup
tar -czf "/backups/$DATE.tar.gz" -C "/backups" "$DATE"

# Remove uncompressed directory
rm -rf "/backups/$DATE"

# Upload to S3
aws s3 cp "/backups/$DATE.tar.gz" "s3://my-backups/mongodb/$DATE.tar.gz"

# Keep only last 7 local backups
find /backups -name "*.tar.gz" -type f -mtime +7 -delete

While this approach worked, it had several significant drawbacks:

  1. Credentials were stored in plaintext in the script
  2. The backup process lacked monitoring and alerting
  3. We had no easy way to test restore procedures
  4. The backup server became a single point of failure
  5. Managing the backup infrastructure was another operational burden

Moving to GitHub Actions

Since our team was already using GitHub Actions for CI/CD, we decided to leverage it for our backup strategy as well. This would give us several advantages:

  • No dedicated backup server to maintain
  • Built-in scheduling with the cron trigger
  • Secret management for database credentials
  • Detailed logs and notifications for failures
  • Version-controlled backup configuration

Our initial GitHub Actions workflow looked like this:

# .github/workflows/mongodb-backup.yml
name: MongoDB Backup

on:
  schedule:
    - cron: '0 2 * * *'  # Run at 2 AM UTC daily
  workflow_dispatch:  # Allow manual trigger

jobs:
  backup:
    runs-on: ubuntu-latest
    steps:
      - name: Install MongoDB tools
        run: |
          wget -qO - https://www.mongodb.org/static/pgp/server-5.0.asc | sudo apt-key add -
          echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-5.0.list
          sudo apt-get update
          sudo apt-get install -y mongodb-database-tools
      
      - name: Run MongoDB backup
        env:
          MONGO_URI: ${{ secrets.MONGO_URI }}
        run: |
          DATE=$(date +%Y-%m-%d)
          mongodump --uri="$MONGO_URI" --out="./backup/$DATE"
          tar -czf "$DATE.tar.gz" -C "./backup" "$DATE"
      
      - name: Upload to S3
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          AWS_REGION: 'us-east-1'
        run: |
          DATE=$(date +%Y-%m-%d)
          aws s3 cp "$DATE.tar.gz" "s3://my-backups/mongodb/$DATE.tar.gz"
      
      - name: Clean up
        run: |
          DATE=$(date +%Y-%m-%d)
          rm -rf "./backup/$DATE" "$DATE.tar.gz"

We set up the necessary GitHub Secrets for MONGO_URI, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY, and then waited for the workflow to run.

Problem #1: Scheduled Actions Not Running

To our surprise, our scheduled backup didn't run the next day. After investigating, we realized that GitHub Actions schedules are only triggered on the default branch. Our workflow was on a feature branch that hadn't been merged yet! Once we merged to the main branch, the scheduled workflow began running.

Problem #2: MongoDB Authentication Errors

Our first successful run failed with an authentication error:

Error: error connecting to db server: server returned error on SASL authentication step: Authentication failed.

After double-checking our credentials, we discovered two issues:

  1. Our MongoDB URI had special characters that needed proper escaping in the GitHub Secrets
  2. We needed to specify the authentication database explicitly

We updated our MongoDB URI in the GitHub Secrets to fix these issues:

# Original (problematic)
mongodb://username:p@[email protected]:27017/mydb

# Updated (working)
mongodb://username:p%[email protected]:27017/mydb?authSource=admin

Problem #3: Network Access Restrictions

Our next attempt failed because our MongoDB server had IP restrictions, and the GitHub Actions runner's IP wasn't in the allowed list. We had two options:

  1. Add GitHub Actions IP ranges to our MongoDB allowed list (not ideal as they can change)
  2. Set up a self-hosted runner inside our VPC with access to the MongoDB server

For security reasons, we chose the second option. We updated our workflow to use a self-hosted runner:

# .github/workflows/mongodb-backup.yml
name: MongoDB Backup

on:
  schedule:
    - cron: '0 2 * * *'
  workflow_dispatch:

jobs:
  backup:
    runs-on: self-hosted  # Changed to self-hosted runner
    steps:
      # Rest of the workflow remains the same

Problem #4: Empty Backups

After fixing the network and authentication issues, our backup workflow ran successfully. However, when we examined the backup files in S3, we found some were empty or much smaller than expected. The issue? We didn't provide enough time for large collections to be dumped before compressing.

We modified our workflow to add more diagnostics and to ensure the mongodump completed successfully:

- name: Run MongoDB backup
  env:
    MONGO_URI: ${{ secrets.MONGO_URI }}
  run: |
    DATE=$(date +%Y-%m-%d)
    echo "Starting MongoDB dump at $(date)"
    mkdir -p "./backup/$DATE"
    
    # Add verbosity for better diagnostics
    mongodump --uri="$MONGO_URI" --out="./backup/$DATE" --verbose
    
    # Check if dump completed successfully
    if [ $? -ne 0 ]; then
      echo "MongoDB dump failed"
      exit 1
    fi
    
    echo "MongoDB dump completed at $(date)"
    echo "Backup size: $(du -sh ./backup/$DATE)"
    
    echo "Compressing backup..."
    tar -czf "$DATE.tar.gz" -C "./backup" "$DATE"
    
    echo "Compression completed. Archive size: $(du -sh $DATE.tar.gz)"

This change gave us better visibility into the backup process and helped us identify that some larger collections needed more time.

Problem #5: S3 Permissions

Our next challenge was with S3 permissions. The workflow could upload backups, but we encountered this error when trying to list existing backups:

An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

The IAM policy for our AWS access key was too restrictive. We updated it to include the necessary permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::my-backups",
        "arn:aws:s3:::my-backups/*"
      ]
    }
  ]
}

Setting Up Backup Rotation

With the basic backup process working reliably, we needed to implement a backup rotation policy to manage storage costs and ensure we had a good retention strategy. We decided on:

  • Daily backups for the past 7 days
  • Weekly backups for the past 4 weeks
  • Monthly backups for the past 12 months
  • Yearly backups indefinitely

We implemented this by extending our workflow:

- name: Implement backup rotation
  env:
    AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
    AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    AWS_REGION: 'us-east-1'
  run: |
    DATE=$(date +%Y-%m-%d)
    DOW=$(date +%u)  # Day of week (1-7)
    DOM=$(date +%d)  # Day of month (01-31)
    
    # Create a weekly backup on Sundays
    if [ "$DOW" = "7" ]; then
      WEEK=$(date +%Y-W%V)
      aws s3 cp "$DATE.tar.gz" "s3://my-backups/mongodb/weekly/$WEEK.tar.gz"
      echo "Created weekly backup: $WEEK.tar.gz"
    fi
    
    # Create a monthly backup on the 1st of the month
    if [ "$DOM" = "01" ]; then
      MONTH=$(date +%Y-%m)
      aws s3 cp "$DATE.tar.gz" "s3://my-backups/mongodb/monthly/$MONTH.tar.gz"
      echo "Created monthly backup: $MONTH.tar.gz"
    fi
    
    # Create a yearly backup on January 1st
    if [ "$DATE" = "$(date +%Y)-01-01" ]; then
      YEAR=$(date +%Y)
      aws s3 cp "$DATE.tar.gz" "s3://my-backups/mongodb/yearly/$YEAR.tar.gz"
      echo "Created yearly backup: $YEAR.tar.gz"
    fi
    
    # Delete daily backups older than 7 days
    aws s3 ls "s3://my-backups/mongodb/" | grep -v "/" | awk '{print $4}' | while read backup; do
      backup_date=$(echo $backup | sed 's/.tar.gz//')
      days_old=$(( ( $(date +%s) - $(date -d "$backup_date" +%s) ) / 86400 ))
      if [ $days_old -gt 7 ]; then
        aws s3 rm "s3://my-backups/mongodb/$backup"
        echo "Deleted old daily backup: $backup"
      fi
    done

Implementing Backup Verification

A backup is only as good as its ability to be restored. We added a verification step to ensure our backups were usable:

- name: Verify backup integrity
  run: |
    DATE=$(date +%Y-%m-%d)
    
    # Create a temporary directory for verification
    mkdir -p ./verify
    
    # Extract backup to verify its contents
    tar -xzf "$DATE.tar.gz" -C ./verify
    
    # Check if extraction was successful
    if [ $? -ne 0 ]; then
      echo "Backup verification failed: Could not extract archive"
      exit 1
    fi
    
    # Count collections to ensure we have data
    COLLECTIONS=$(find ./verify/$DATE -name "*.bson" | wc -l)
    echo "Backup contains $COLLECTIONS collections"
    
    if [ $COLLECTIONS -eq 0 ]; then
      echo "Backup verification failed: No collections found"
      exit 1
    fi
    
    # List some collection sizes
    find ./verify/$DATE -name "*.bson" -exec ls -lh {} ; | sort -rh | head -5
    
    echo "Backup verification completed successfully"

Adding Restore Functionality

With backups working reliably, we created a separate workflow for restoring data when needed. This workflow would be manually triggered with parameters to specify which backup to restore:

# .github/workflows/mongodb-restore.yml
name: MongoDB Restore

on:
  workflow_dispatch:
    inputs:
      backup_date:
        description: 'Backup date to restore (YYYY-MM-DD)'
        required: true
      target_database:
        description: 'Target database name'
        required: true
      collections:
        description: 'Specific collections to restore (comma-separated, leave empty for all)'
        required: false

jobs:
  restore:
    runs-on: self-hosted
    steps:
      - name: Install MongoDB tools
        run: |
          # MongoDB tools installation script
          
      - name: Download backup from S3
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          AWS_REGION: 'us-east-1'
        run: |
          aws s3 cp "s3://my-backups/mongodb/${{ github.event.inputs.backup_date }}.tar.gz" ./
          tar -xzf "${{ github.event.inputs.backup_date }}.tar.gz"
          
      - name: Restore database
        env:
          MONGO_URI: ${{ secrets.MONGO_URI }}
        run: |
          if [ -z "${{ github.event.inputs.collections }}" ]; then
            # Restore entire database
            mongorestore --uri="$MONGO_URI" --nsInclude="${{ github.event.inputs.target_database }}.*" --drop "./${{ github.event.inputs.backup_date }}"
          else
            # Restore specific collections
            IFS=',' read -ra COLLECTIONS <<< "${{ github.event.inputs.collections }}"
            for collection in "${COLLECTIONS[@]}"; do
              echo "Restoring collection: $collection"
              mongorestore --uri="$MONGO_URI" --nsInclude="${{ github.event.inputs.target_database }}.$collection" --drop "./${{ github.event.inputs.backup_date }}"
            done
          fi
          
      - name: Cleanup
        run: |
          rm -rf "${{ github.event.inputs.backup_date }}" "${{ github.event.inputs.backup_date }}.tar.gz"

This restore workflow allowed us to quickly recover from data issues by selecting a backup date and optionally specifying which collections to restore.

Improved Security with OIDC

As a final security improvement, we replaced static AWS credentials with OpenID Connect (OIDC) to obtain temporary credentials. This eliminated the need to store long-lived AWS access keys in GitHub Secrets.

First, we set up the AWS IAM Identity Provider and role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
          "token.actions.githubusercontent.com:sub": "repo:our-org/our-repo:ref:refs/heads/main"
        }
      }
    }
  ]
}

Then we updated our workflow to use OIDC instead of static credentials:

- name: Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v1
  with:
    role-to-assume: arn:aws:iam::123456789012:role/github-actions-mongodb-backup
    aws-region: us-east-1

This change eliminated a significant security risk by removing long-lived credentials from our GitHub Secrets.

Adding Monitoring and Notifications

To complete our backup system, we added comprehensive monitoring and notifications:

- name: Send notification
  if: always()  # Run even if previous steps failed
  uses: slackapi/slack-github-action@v1
  with:
    payload: |
      {
        "text": "MongoDB Backup ${{ job.status }}",
        "blocks": [
          {
            "type": "section",
            "text": {
              "type": "mrkdwn",
              "text": "*MongoDB Backup ${{ job.status }}*\nRepository: ${{ github.repository }}\nWorkflow: MongoDB Backup\nDate: $(date +%Y-%m-%d)"
            }
          },
          {
            "type": "section",
            "text": {
              "type": "mrkdwn",
              "text": "Backup Size: ${{ env.BACKUP_SIZE }}\nBackup Location: s3://my-backups/mongodb/$(date +%Y-%m-%d).tar.gz"
            }
          },
          {
            "type": "actions",
            "elements": [
              {
                "type": "button",
                "text": {
                  "type": "plain_text",
                  "text": "View Workflow Run"
                },
                "url": "https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
              }
            ]
          }
        ]
      }
  env:
    SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
    SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK
    BACKUP_SIZE: ${{ env.BACKUP_SIZE }}

This notification would be sent to Slack after each backup, whether successful or failed, allowing us to quickly identify and troubleshoot any issues.

Testing the Disaster Recovery Process

With our backup and restore workflows in place, we scheduled regular disaster recovery tests to ensure our team could quickly respond to a real emergency. Every quarter, we:

  1. Spin up a test MongoDB instance
  2. Restore a recent backup to the test instance
  3. Verify data integrity and test application functionality
  4. Practice the entire recovery procedure with an engineer who hadn't performed it before
  5. Document any issues or improvements

These tests proved invaluable when we did face a real data loss situation six months later. Our team was able to restore from backups with confidence and minimal downtime.

Lessons Learned

Implementing this MongoDB backup solution with GitHub Actions taught us several important lessons:

  1. Credential management is critical: Proper escaping of special characters in connection strings and secure handling of credentials can prevent numerous headaches.
  2. Network security requires planning: Consider network access restrictions early in your design process, not as an afterthought.
  3. Verify backups proactively: An unverified backup is potentially useless. Always include verification steps.
  4. Create backup hierarchies: Having different retention policies for daily, weekly, monthly, and yearly backups provides flexibility for recovery options.
  5. Test restore procedures regularly: The real test of a backup system is restoration. Practice it regularly.
  6. Document everything: When disaster strikes, clear documentation makes recovery faster and less stressful.

Conclusion

Using GitHub Actions for MongoDB backups has been a game-changer for our team. We went from a manual, error-prone process to a fully automated, monitored, and tested backup system that has already proven its value in real-world recovery scenarios.

The combination of scheduled workflows, self-hosted runners, secure credential management, and comprehensive verification provides peace of mind that our data is protected. The added benefits of version-controlled backup configurations and integration with our existing CI/CD system make GitHub Actions an excellent choice for database backups.

If your team is already using GitHub, I highly recommend leveraging GitHub Actions for critical operational tasks like database backups. The investment in setting up proper automation will pay dividends the first time you need to recover from a data issue—and you will eventually need to recover, it's just a matter of when.