Part 3 - Memory Management and Performance Optimization

PART 3: MEMORY MANAGEMENT AND PERFORMANCE OPTIMIZATION

Understanding Zeek’s Memory Usage

Zeek is a memory-intensive application because it maintains rich state about network connections and protocol sessions. Understanding how Zeek uses memory will help you size your systems appropriately and write scripts that don’t cause memory problems.

Major memory consumers:

┌──────────────────────────────────────────────────────────────┐
│                ZEEK MEMORY USAGE BREAKDOWN                   │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  CONNECTION STATE (40-60% of memory)                         │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                         │
│  • Connection records for all active connections             │
│  • Protocol-specific state (HTTP transactions, DNS queries)  │
│  • Packet reassembly buffers                                 │
│                                                              │
│  SCRIPT STATE (20-40% of memory)                             │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                         │
│  • Tables and sets maintained by scripts                     │
│  • Intel framework indicator database                        │
│  • Statistical tracking structures                           │
│  • Custom data structures in your scripts                    │
│                                                              │
│  PACKET BUFFERS (10-20% of memory)                           │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                         │
│  • Buffers for incoming packets                              │
│  • Reassembly buffers for fragmented traffic                 │
│                                                              │
│  ZEEK CORE (5-10% of memory)                                 │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                         │
│  • Zeek executable and libraries                             │
│  • Event engine data structures                              │
│  • Protocol analyzers                                        │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Memory usage estimation:

Here’s a rough formula for estimating Zeek’s memory requirements:

Base Memory = 2 GB (Zeek core + base scripts)

Connection Memory = (Active Connections) × (Memory per Connection)
                  = (Active Connections) × ~10 KB

Script Memory = Variable, depends on your detection scripts
              = 1-4 GB for typical deployments

Total = Base + Connection Memory + Script Memory

Example calculations:

Network Type	Concurrent Connections	Estimated Memory
Small office (100 Mbps)	~5,000	2 + (5K × 10KB) = ~2.5 GB
Medium enterprise (1 Gbps)	~50,000	2 + (50K × 10KB) = ~3 GB
Large enterprise (10 Gbps)	~500,000	2 + (500K × 10KB) = ~7 GB
Data center (100 Gbps)	~5,000,000	2 + (5M × 10KB) = ~50 GB

Add 50-100% headroom to these estimates for safety and to account for script memory usage.

Performance Tuning: Getting the Most from Your Hardware

Zeek’s performance depends on several factors: hardware capabilities, traffic characteristics, configuration choices, and script complexity. Let’s explore the key performance considerations and how to optimize them.

CPU Considerations

Zeek is CPU-intensive, particularly for protocol parsing and script execution. CPU speed and architecture significantly impact performance.

CPU requirements scale with:

Traffic volume (more packets = more processing)
Traffic complexity (application protocols require more parsing than simple TCP)
Script complexity (sophisticated detection logic uses more CPU)
Enabled features (file extraction, statistical analysis add overhead)

CPU optimization strategies:

Strategy	Impact	When to Use
Higher clock speed	Significant	Single-worker deployments
More cores	Significant	Cluster deployments
Modern CPU architecture	Moderate	New deployments
Disable unused analyzers	Moderate	Specialized monitoring
Optimize scripts	Variable	Always

Memory Performance

Memory speed and capacity affect Zeek’s ability to maintain state for large numbers of connections.

Memory optimization:

Use fast RAM (DDR4-3200 or better)
Ensure sufficient capacity to avoid swapping (swapping kills performance)
Consider memory channels (more channels = better bandwidth)

Connection timeout tuning:

One of the most effective ways to control memory usage is adjusting connection timeouts. Zeek expires connection state for inactive connections based on configured timeouts:

# Default timeouts (in /usr/local/zeek/share/zeek/base/init-bare.zeek)
redef tcp_inactivity_timeout = 5 min;  # TCP connections
redef udp_inactivity_timeout = 1 min;  # UDP "connections"
redef icmp_inactivity_timeout = 1 min; # ICMP

Tuning considerations:

┌──────────────────────────────────────────────────────────────┐
│            CONNECTION TIMEOUT TRADE-OFFS                     │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  SHORTER TIMEOUTS (e.g., 1-2 minutes)                        │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━    │
│  ✓ Lower memory usage                                        │
│  ✓ Faster state cleanup                                      │
│  ✗ May prematurely expire legitimate long connections        │
│  ✗ Could miss attacks that span long periods                 │
│                                                              │
│  LONGER TIMEOUTS (e.g., 10-15 minutes)                       │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━    │
│  ✓ Complete tracking of long-lived connections               │
│  ✓ Better detection of persistent threats                    │
│  ✗ Higher memory usage                                       │
│  ✗ Slower to reclaim memory from idle connections            │
│                                                              │
└──────────────────────────────────────────────────────────────┘

For high-volume networks where memory is constrained, shorter timeouts help. For threat hunting where you need to track persistent connections, longer timeouts are better.

Disk I/O Performance

Logging generates significant disk I/O. On a busy network, Zeek might write several gigabytes of logs per hour.

Storage recommendations:

Storage Type	Performance	Use Case
NVMe SSD	Excellent	High-volume logging, logger nodes
SATA SSD	Good	Medium-volume logging
HDD RAID	Adequate	Archival storage, budget deployments
Network storage	Variable	Centralized log storage (watch latency)

I/O optimization strategies:

Use fast local storage for active logs
Rotate logs frequently to prevent huge files
Compress and move old logs to slower archival storage
Consider separate file systems for logs vs OS

Network Performance

For cluster deployments, network connectivity between nodes affects performance.

Network requirements:

┌──────────────────────────────────────────────────────────────┐
│             CLUSTER NETWORK REQUIREMENTS                     │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  MANAGEMENT NETWORK (between cluster nodes)                  │
│  • Minimum: 1 Gbps                                           │
│  • Recommended: 10 Gbps for large clusters                   │
│  • Low latency critical for coordination                     │
│  • Separate from monitored network                           │
│                                                              │
│  MONITORING NETWORK (where packets are captured)             │
│  • Must match or exceed monitored network speed              │
│  • Often uses SPAN/TAP ports (receive-only)                  │
│  • No IP configuration needed (promiscuous mode)             │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Monitoring Zeek’s Performance

To ensure Zeek is performing well, you need to monitor its health and performance metrics.

Key metrics to track:

Metric	What It Indicates	Warning Threshold
Packet drop rate	Can’t keep up with traffic	>1%
Memory usage	Approaching capacity	>80%
CPU usage	Processing load	>80% sustained
Disk I/O wait	Storage bottleneck	>10%
Event queue depth	Script processing lag	>1000 events

Zeek provides performance statistics:

# Check for dropped packets
zeek-cut ts dropped captured < capture_loss.log

# Monitor Zeek's internal stats
zeek-cut ts mem event_queue_size < stats.log

Common performance problems and solutions:

┌──────────────────────────────────────────────────────────────┐
│           TROUBLESHOOTING PERFORMANCE ISSUES                 │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  SYMPTOM: Dropping packets                                   │
│  CAUSES: CPU overload, memory exhaustion, network bottleneck │
│  SOLUTIONS:                                                  │
│    • Add more workers to cluster                             │
│    • Optimize or disable expensive scripts                   │
│    • Increase memory                                         │
│    • Improve packet acquisition (AF_PACKET, PF_RING)         │
│                                                              │
│  SYMPTOM: High memory usage                                  │
│  CAUSES: Too many concurrent connections, memory leaks       │
│  SOLUTIONS:                                                  │
│    • Reduce connection timeouts                              │
│    • Check scripts for unbounded table growth                │
│    • Add more memory                                         │
│    • Implement table expiration policies                     │
│                                                              │
│  SYMPTOM: Event queue growing                                │
│  CAUSES: Slow event handlers blocking processing             │
│  SOLUTIONS:                                                  │
│    • Profile scripts to find slow handlers                   │
│    • Optimize expensive operations                           │
│    • Remove or disable unused scripts                        │
│    • Reduce analysis granularity                             │
│                                                              │
│  SYMPTOM: High disk I/O wait                                 │
│  CAUSES: Slow storage, excessive logging                     │
│  SOLUTIONS:                                                  │
│    • Upgrade to faster storage (SSD/NVMe)                    │
│    • Reduce logging verbosity                                │
│    • Enable log compression                                  │
│    • Use separate file system for logs                       │
│                                                              │
└──────────────────────────────────────────────────────────────┘