Managing Logs with MongoDB TTL Indexes

When our application's logging system started causing disk space alerts at 3 AM, I knew we had to find a better way to manage our MongoDB logs. The journey to implement Time-To-Live (TTL) indexes taught me several valuable lessons about MongoDB's data management capabilities and the nuances of automated data cleanup.

The Problem: Overwhelming Log Growth

Our application had been happily logging events, errors, and audit trails to MongoDB for months. The setup was simple - each log entry went into a logs collection with timestamps, levels, and message data. But we hadn't implemented any cleanup strategy, assuming we'd deal with it "later."

Later arrived in the form of a 3 AM phone call when our monitoring system alerted that one of our MongoDB servers was running dangerously low on disk space. The logs collection had grown to over 50GB, containing months of data that we didn't actually need to keep.

Our immediate response was a manual cleanup:

// Delete logs older than 30 days
db.logs.deleteMany({
  timestamp: { $lt: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) }
})

But this manual approach had two significant problems:

It required manual intervention (not ideal at 3 AM)
Bulk deletion operations on large collections can impact database performance

We needed an automated, low-impact solution.

Enter TTL Indexes

MongoDB's Time-To-Live (TTL) indexes seemed like the perfect solution. They automatically remove documents after a specified period has passed. I read the documentation and implemented what seemed like a straightforward index:

// Create a TTL index on the timestamp field to expire documents after 30 days
db.logs.createIndex(
  { timestamp: 1 },
  { expireAfterSeconds: 2592000 } // 30 days in seconds
)

I verified the index was created properly:

db.logs.getIndexes()
[
  {
    "v" : 2,
    "key" : { "_id" : 1 },
    "name" : "_id_",
    "ns" : "appdb.logs"
  },
  {
    "v" : 2,
    "key" : { "timestamp" : 1 },
    "name" : "timestamp_1",
    "ns" : "appdb.logs",
    "expireAfterSeconds" : 2592000
  }
]

Feeling confident, I went back to sleep, expecting MongoDB to clean up old logs automatically.

Problem #1: TTL Indexes Don't Work Instantly

The next morning, I checked the logs collection size and was surprised to find it hadn't decreased. Looking deeper into MongoDB's documentation, I discovered something important that I had missed:

"The background task that removes expired documents runs every 60 seconds. As a result, documents may remain in a collection during the period between the expiration of the document and the running of the background task."

Moreover, TTL deletions happen slowly to avoid impacting database performance. They don't delete all expired documents at once but instead work gradually. For a 50GB collection, this could take quite some time.

I decided to be patient and check again later.

Problem #2: String Dates Don't Work with TTL Indexes

After waiting several hours and seeing minimal changes in collection size, I investigated further. Looking at our application code, I noticed a critical issue with how we were storing timestamps:

// Our logging code
const logEvent = (level, message, metadata) => {
  const logEntry = {
    level,
    message,
    metadata,
    timestamp: new Date().toISOString(), // Storing as a string!
    service: 'api-service'
  };
  db.collection('logs').insertOne(logEntry);
};

We were storing timestamps as ISO string representations rather than actual Date objects! TTL indexes only work with actual Date objects, not string representations of dates.

Looking at a sample document confirmed the issue:

{
  "_id" : ObjectId("5f8a7b9c3e24c2001f7a9b23"),
  "level" : "info",
  "message" : "User logged in",
  "metadata" : { "userId" : "user123", "ip" : "192.168.1.1" },
  "timestamp" : "2023-11-10T14:32:25.123Z", // String, not a Date object!
  "service" : "api-service"
}

Solution to Problem #2: Migrating to Date Objects

We needed to update our application code to store proper Date objects and migrate existing data. Here's how we changed our logging code:

// Updated logging code
const logEvent = (level, message, metadata) => {
  const logEntry = {
    level,
    message,
    metadata,
    timestamp: new Date(), // Now storing as a Date object
    service: 'api-service'
  };
  db.collection('logs').insertOne(logEntry);
};

For existing data, we had to run a migration script:

// Migration script to convert string timestamps to Date objects
const cursor = db.logs.find({
  timestamp: { $type: "string" }
});

let bulkOps = [];
let count = 0;
const batchSize = 1000;

cursor.forEach(doc => {
  bulkOps.push({
    updateOne: {
      filter: { _id: doc._id },
      update: { $set: { timestamp: new Date(doc.timestamp) } }
    }
  });
  
  count++;
  
  if (count % batchSize === 0) {
    db.logs.bulkWrite(bulkOps);
    bulkOps = [];
    print(`Processed ${count} documents`);
  }
});

if (bulkOps.length > 0) {
  db.logs.bulkWrite(bulkOps);
}

print(`Migration complete. Total: ${count} documents`);

This migration took several hours to complete but successfully converted our string timestamps to Date objects.

Problem #3: TTL Precision and Schedule

With proper Date objects now in place, I expected the TTL mechanism to start working promptly. But after waiting another day, I noticed that while some documents were being deleted, the process seemed slower than expected.

Further research revealed another important aspect of TTL indexes:

"The TTL monitor runs once per minute, but the actual rate of deletion is limited to avoid impacting database performance. Large collections may take several hours or even days to fully process."

Additionally, I discovered that the precision of TTL deletion is only to the minute, not to the second. This wasn't a problem for our logs, but it's an important consideration for applications that need more precise expiration.

Problem #4: Accidental Short TTL Value

While monitoring the deletion process, I decided to create a separate test collection to verify the TTL behavior. Unfortunately, I made a typo in the expireAfterSeconds value:

// Create a test collection with TTL index
db.logs_test.createIndex(
  { timestamp: 1 },
  { expireAfterSeconds: 2592 } // Intended 2592000 (30 days), but typed 2592 (43.2 minutes)!
)

I populated the test collection with some sample data and came back after lunch to find all the documents gone! This taught me to be extremely careful with TTL values and to always double-check them before implementation.

Advanced TTL Strategy: Different Retention Policies

As our logging system matured, we realized different types of logs needed different retention policies:

Debug logs: Keep for 3 days
Info logs: Keep for 15 days
Warning logs: Keep for 30 days
Error logs: Keep for 60 days
Audit logs: Keep for 365 days (for compliance)

However, TTL indexes can't have conditional expiration times based on document fields like "level". We had two options:

Split into separate collections by log level
Use a computed expiration field

We chose the second approach, creating a new field called expiresAt with a pre-calculated expiration date:

// Updated logging code with calculated expiration
const logEvent = (level, message, metadata) => {
  const now = new Date();
  
  // Calculate expiration date based on level
  let retentionDays;
  switch (level) {
    case 'debug':
      retentionDays = 3;
      break;
    case 'info':
      retentionDays = 15;
      break;
    case 'warning':
      retentionDays = 30;
      break;
    case 'error':
      retentionDays = 60;
      break;
    case 'audit':
      retentionDays = 365;
      break;
    default:
      retentionDays = 30; // Default retention
  }
  
  const expiresAt = new Date(now.getTime() + retentionDays * 24 * 60 * 60 * 1000);
  
  const logEntry = {
    level,
    message,
    metadata,
    timestamp: now,
    expiresAt: expiresAt,
    service: 'api-service'
  };
  
  db.collection('logs').insertOne(logEntry);
};

Then we created a TTL index on the expiresAt field:

// Create TTL index on expiresAt field
db.logs.createIndex(
  { expiresAt: 1 },
  { expireAfterSeconds: 0 } // Delete when current time > expiresAt
)

This approach gave us flexible retention policies while still using MongoDB's automatic TTL deletion mechanism.

Monitoring TTL Deletion Activity

To ensure our TTL indexes were working properly, we implemented monitoring to track collection size and document counts by age:

// Monitor logs by age groups
const now = new Date();
const dayInMs = 24 * 60 * 60 * 1000;

const counts = [
  { 
    range: "0-7 days", 
    count: db.logs.countDocuments({ 
      timestamp: { $gte: new Date(now - 7 * dayInMs) } 
    }) 
  },
  { 
    range: "8-30 days", 
    count: db.logs.countDocuments({ 
      timestamp: { 
        $lt: new Date(now - 7 * dayInMs),
        $gte: new Date(now - 30 * dayInMs)
      } 
    }) 
  },
  { 
    range: "31-60 days", 
    count: db.logs.countDocuments({ 
      timestamp: { 
        $lt: new Date(now - 30 * dayInMs),
        $gte: new Date(now - 60 * dayInMs)
      } 
    }) 
  },
  { 
    range: "> 60 days", 
    count: db.logs.countDocuments({ 
      timestamp: { $lt: new Date(now - 60 * dayInMs) } 
    }) 
  }
];

// Size in MB
const collStats = db.logs.stats();
const sizeInMB = collStats.size / (1024 * 1024);

print(`Logs collection size: ${sizeInMB.toFixed(2)} MB`);
print("Documents by age:");
counts.forEach(c => print(`- ${c.range}: ${c.count} documents`));

We scheduled this script to run daily and alert us if the oldest age group still had documents past their expected expiration date.

Handling TTL Index Modifications

At one point, we needed to change our retention policy for info logs from 15 days to 10 days. Modifying a TTL index in MongoDB requires dropping and recreating it:

// Can't modify directly, need to drop and recreate
db.logs.dropIndex("expiresAt_1");

// Create with new settings
db.logs.createIndex(
  { expiresAt: 1 },
  { expireAfterSeconds: 0 }
);

But we also needed to update the expiresAt values for existing info logs. We wrote a focused update operation:

// Update expiration for existing info logs
const now = new Date();
const cutoffDate = new Date(now - 10 * 24 * 60 * 60 * 1000); // 10 days ago

db.logs.updateMany(
  {
    level: "info",
    timestamp: { $lt: cutoffDate },
    expiresAt: { $gt: now } // Only those not already expired
  },
  {
    $set: { expiresAt: now } // Set to expire immediately
  }
);

Handling High-Volume Log Insertion

As our application grew, we started generating millions of log entries per day. We noticed that individual inserts were becoming a performance bottleneck. We modified our logging approach to use bulk operations:

// Batch logging implementation
class BatchLogger {
  constructor(db, batchSize = 100, flushIntervalMs = 5000) {
    this.db = db;
    this.batchSize = batchSize;
    this.flushIntervalMs = flushIntervalMs;
    this.logBuffer = [];
    
    // Set up periodic flush
    this.flushInterval = setInterval(() => this.flush(), this.flushIntervalMs);
  }
  
  log(level, message, metadata) {
    const now = new Date();
    
    let retentionDays;
    // ... retention logic same as before ...
    
    const expiresAt = new Date(now.getTime() + retentionDays * 24 * 60 * 60 * 1000);
    
    this.logBuffer.push({
      level,
      message,
      metadata,
      timestamp: now,
      expiresAt: expiresAt,
      service: 'api-service'
    });
    
    // If buffer reaches batchSize, flush immediately
    if (this.logBuffer.length >= this.batchSize) {
      this.flush();
    }
  }
  
  async flush() {
    if (this.logBuffer.length === 0) return;
    
    const logs = [...this.logBuffer];
    this.logBuffer = [];
    
    try {
      await this.db.collection('logs').insertMany(logs);
    } catch (err) {
      console.error('Failed to insert logs:', err);
      // In case of error, add critical logs back to buffer
      const criticalLogs = logs.filter(log => 
        log.level === 'error' || log.level === 'audit'
      );
      
      if (criticalLogs.length > 0) {
        this.logBuffer.unshift(...criticalLogs);
      }
    }
  }
  
  shutdown() {
    clearInterval(this.flushInterval);
    return this.flush(); // Final flush
  }
}

This batch approach significantly reduced the database write load while maintaining logging reliability.

Lessons Learned

Implementing TTL indexes for log management taught us several important lessons:

TTL indexes require actual Date objects: String representations won't work, even if they're valid ISO date strings.
TTL deletion is not immediate: The TTL monitor runs every minute, but deletions are throttled to prevent performance impact.
TTL precision is to the minute: Don't rely on second-level precision for document expiration.
Double-check expiration values: A simple typo can lead to data being deleted much sooner than intended.
Computed expiration fields provide flexibility: Using a separate expiresAt field allows for different retention policies.
Monitoring is essential: Regular checks ensure that the TTL mechanism is working as expected.

Final Results

After implementing proper TTL indexes with our improved logging system:

Our logs collection stabilized at around 8GB (down from the peak of 50GB)
Disk space alerts became a thing of the past
We had properly tiered retention policies for different log types
The system scaled smoothly even as our log volume increased
Our database performance improved with the reduced storage overhead

Most importantly, I haven't received a 3 AM phone call about database disk space since we implemented this solution!

Conclusion

MongoDB's TTL indexes are a powerful tool for automated data lifecycle management, but they come with nuances that aren't immediately obvious from the documentation. By understanding the behavior of TTL indexes and implementing proper monitoring, you can create a robust, low-maintenance solution for managing time-sensitive data like logs.

For any application that generates significant amounts of temporary or aging data, TTL indexes should be one of the first MongoDB features you explore. They provide an elegant, database-level solution for data expiration that doesn't require external cleanup scripts or cron jobs.

← Back to all posts