Troubleshooting Guide
Quick reference for diagnosing and fixing common issues in RhythmX production environments.
Service Health Check (First Step)
Run this to see the status of all RhythmX services:
systemctl status api_gateway nginx mysqld logstash sigma_sql_backend \
risk_scoring isolation_forest sigma-case-correlation actor_cache_sync \
logrhythm-sync sigma-syslog-sender --no-pager
Check all timers:
systemctl list-timers | grep -i sigma
Service Won't Start / Crashed
API Gateway (api_gateway.service)
Symptoms: UI not loading, all API calls fail, 502 Bad Gateway
Diagnose:
systemctl status api_gateway
journalctl -u api_gateway -n 50 --no-pager
Common causes and fixes:
| Cause | Log message | Fix |
|---|---|---|
| MySQL down | Can't connect to MySQL |
systemctl start mysqld |
| Port in use | Address already in use |
lsof -i :5000 then kill the stale process |
| Permission denied on .env | Permission denied: .env.production |
setfacl -m u:sigma_api:rw /opt/Sigma_ML_MSSP/.env.production |
| Python package missing | ModuleNotFoundError |
pip3 install <missing_package> |
| Bad .env syntax | ValueError on startup |
Check .env.production for syntax errors |
Restart:
systemctl restart api_gateway
# Verify:
curl -sk -o /dev/null -w "%{http_code}" https://localhost/
# Should return 200
Nginx
Symptoms: Can't reach the site at all, connection refused on 443
Diagnose:
systemctl status nginx
nginx -t # Test config syntax
tail -20 /var/log/nginx/error.log
Common causes and fixes:
| Cause | Fix |
|---|---|
| Config syntax error | nginx -t shows the error — fix the config file |
| SSL cert missing/expired | Check /etc/nginx/ssl/ — regenerate self-signed cert |
| Port 443 already in use | lsof -i :443 — kill conflicting process |
| SELinux blocking | ausearch -m avc -ts recent — add policy or set permissive |
Restart:
nginx -t && systemctl restart nginx
MySQL (mysqld.service)
Symptoms: Everything fails — API, detections, incidents
Diagnose:
systemctl status mysqld
journalctl -u mysqld -n 50 --no-pager
mysql -u root -p"$(grep DB_PASSWORD /opt/Sigma_ML_MSSP/.env.production | cut -d= -f2)" -e "SELECT 1;"
Common causes and fixes:
| Cause | Log message | Fix |
|---|---|---|
| Disk full | No space left on device |
Free disk space, check df -h |
| InnoDB corruption | InnoDB: corruption |
Restore from backup |
| Too many connections | Too many connections |
SET GLOBAL max_connections = 500; |
| Wrong password | Access denied |
Check DB_PASSWORD in .env.production |
| Socket file missing | Can't connect through socket |
systemctl restart mysqld |
Check database health:
DB_PASS=$(grep DB_PASSWORD /opt/Sigma_ML_MSSP/.env.production | cut -d= -f2)
mysql -u root -p"$DB_PASS" -e "
SELECT 'sigma_alerts' as tbl, COUNT(*) as rows FROM sigma_alerts
UNION SELECT 'threat_cases', COUNT(*) FROM threat_cases
UNION SELECT 'incidents', COUNT(*) FROM incidents
UNION SELECT 'alarm_actors', COUNT(*) FROM alarm_actors;"
Logstash (logstash.service)
Symptoms: No new logs being ingested, ports 5514/5515/5516 not responding
Diagnose:
systemctl status logstash
journalctl -u logstash -n 50 --no-pager
# Check if ports are listening:
ss -tlnp | grep -E "5514|5515|5516"
Common causes and fixes:
| Cause | Fix |
|---|---|
| Pipeline config error | Check /etc/logstash/conf.d/*.conf for syntax |
| JVM out of memory | Increase heap in /etc/logstash/jvm.options |
| Port conflict | Another process on 5514/5515/5516 — kill it |
| Persisted queue corrupt | Remove /var/lib/logstash/queue/ and restart |
Check pipeline stats:
curl -s http://localhost:9600/_node/stats/pipelines | python3 -m json.tool | head -30
Detection Pipeline Issues
No New sigma_alerts Appearing
Symptoms: Investigation page shows stale data, no new detections
Diagnose:
# Check when the last detection was:
DB_PASS=$(grep DB_PASSWORD /opt/Sigma_ML_MSSP/.env.production | cut -d= -f2)
mysql -u root -p"$DB_PASS" sigma_db -e "SELECT MAX(system_time) as last_detection FROM sigma_alerts;"
# Check if rotation is running:
systemctl status sigma-log-rotate.timer
journalctl -u sigma-log-rotate -n 20 --no-pager
# Check if log files are growing:
ls -la /var/log/logstash/processed_syslog.xml
Common causes:
| Cause | Fix |
|---|---|
| Logstash not receiving logs | Check source (LogRhythm/syslog) is sending to correct IP:port |
| Log rotation not triggering | systemctl restart sigma-log-rotate.timer |
| RhythmX Rules engine failing | Check /var/log/logstash/rotation_errors.log |
| Processed file empty | Source not sending — check network/firewall |
| Detection output dir full | Clean /var/log/logstash/detected_rhythmx/ |
No New Threat Cases
Symptoms: Threat cases section empty, no correlated attack patterns
Diagnose:
systemctl status sigma-case-correlation
journalctl -u sigma-case-correlation -n 30 --no-pager
# Check if sigma_alerts exist (cases are built from these):
mysql -u root -p"$DB_PASS" sigma_db -e "SELECT COUNT(*) FROM sigma_alerts WHERE system_time >= DATE_SUB(NOW(), INTERVAL 24 HOUR);"
Common causes:
| Cause | Fix |
|---|---|
| No sigma_alerts | Fix detection pipeline first (see above) |
| Correlation service not running | systemctl restart sigma-case-correlation |
| All alerts already in cases | Check case_alerts table — alerts can't be reused |
| Thresholds too high | Check use case YAML files in /opt/Sigma_ML_MSSP/cases/config/use_cases/ |
Incident Issues
Incidents Not Being Created
Symptoms: Qualifying actors in alarms page but no incidents in /incidents
Diagnose:
# Is the timer running?
systemctl status sigma-incident-sync.timer
systemctl list-timers | grep incident
# Check last run:
journalctl -u sigma-incident-sync -n 30 --no-pager
# Check actor_risk_summary for qualifying actors:
mysql -u root -p"$DB_PASS" sigma_db -e "
SELECT actor_key, risk_score, risk_level, threat_case_count, last_seen
FROM actor_risk_summary
WHERE risk_score >= 70 OR threat_case_count >= 1
ORDER BY risk_score DESC LIMIT 10;"
Common causes:
| Cause | Fix |
|---|---|
| Timer not running | systemctl start sigma-incident-sync.timer |
| actor_risk_summary empty | systemctl restart actor_cache_sync and wait 5 minutes |
| Risk score below threshold | Check INCIDENT_MIN_RISK_SCORE in .env.production (default: 70) |
| last_seen too old (Path B) | Activity must be within INCIDENT_RISK_FRESHNESS_HOURS (default: 24h) |
| DB connection error | Check MySQL is running and password is correct |
Duplicate Incidents
Symptoms: Same actor has multiple OPEN incidents
Diagnose:
mysql -u root -p"$DB_PASS" sigma_db -e "
SELECT group_key, COUNT(*) as incident_count
FROM incidents WHERE status = 'OPEN'
GROUP BY group_key HAVING COUNT(*) > 1;"
Fix: This was a bug in the old timestamp-based ID system. The new system uses hash(actor_key + actor_type + entity + sequence). To clean up:
# Keep the most recent incident per actor, delete duplicates:
mysql -u root -p"$DB_PASS" sigma_db -e "
DELETE i FROM incidents i
INNER JOIN (
SELECT group_key, MAX(id) as keep_id
FROM incidents WHERE status = 'OPEN'
GROUP BY group_key
) keep ON i.group_key = keep.group_key
WHERE i.id != keep.keep_id AND i.status = 'OPEN';"
Auto-Close Not Working
Symptoms: Old incidents staying OPEN even with no activity
Diagnose:
# Check if incidents should have been closed:
mysql -u root -p"$DB_PASS" sigma_db -e "
SELECT incident_id, group_key, status, last_alarm_time,
TIMESTAMPDIFF(HOUR, last_alarm_time, NOW()) as hours_since_last
FROM incidents
WHERE status IN ('OPEN', 'IN_PROGRESS')
AND last_alarm_time < DATE_SUB(NOW(), INTERVAL 24 HOUR);"
# Check if actor still has open threat cases (blocks auto-close):
mysql -u root -p"$DB_PASS" sigma_db -e "
SELECT i.group_key, COALESCE(a.threat_case_count, 0) as cases
FROM incidents i
LEFT JOIN actor_risk_summary a ON i.group_key = a.actor_key
WHERE i.status = 'OPEN';"
Common causes:
| Cause | Fix |
|---|---|
| Timer not running | systemctl start sigma-incident-sync.timer |
| Actor has open threat cases | Auto-close is blocked (by design) — close cases first |
INCIDENT_AUTO_CLOSE_HOURS too high |
Check .env.production (default: 24) |
Authentication Issues
Can't Login / JWT Expired
Symptoms: Login page shows error, or page keeps showing "Session expired"
Diagnose:
# Test login API directly:
curl -sk -X POST https://localhost/api/login \
-H "Content-Type: application/json" \
-d '{"username":"rhythmx","password":"YOUR_PASSWORD"}'
# Check JWT config:
grep JWT_SECRET /opt/Sigma_ML_MSSP/.env.production | head -1
Common causes:
| Cause | Fix |
|---|---|
| Wrong password | Check DEFAULT_ADMIN_PASSWORD in .env.production |
| JWT secret changed | All existing tokens invalidated — users must re-login |
| API Gateway down | systemctl restart api_gateway |
| Browser cache | Hard refresh Ctrl+Shift+R or clear cookies |
LDAP Sync Failing
Symptoms: AD users not showing in RhythmX, LDAP users can't login
Diagnose:
systemctl status ldapy_sync.timer
journalctl -u ldapy_sync -n 30 --no-pager
# Test LDAP connection manually:
python3 -c "
from ldap3 import Server, Connection
s = Server('YOUR_LDAP_SERVER', port=389)
c = Connection(s, 'DOMAIN\\\\user', 'password')
print('Connected:', c.bind())
"
Common causes:
| Cause | Fix |
|---|---|
| Wrong LDAP credentials | Check LDAP_USER and LDAP_PASSWORD in .env.production |
| LDAP server unreachable | Check network/firewall — telnet LDAP_SERVER 389 |
| Invalid Base DN | Verify LDAP_BASE_DN format (e.g., DC=company,DC=local) |
| Timer not running | systemctl start ldapy_sync.timer |
External API Key Not Working
Symptoms: External feed API returns 401
Diagnose:
# Test the key:
curl -sk -w "%{http_code}" https://localhost/api/v1/feed/health \
-H "X-API-Key: YOUR_KEY"
# Check if key exists and is enabled:
mysql -u root -p"$DB_PASS" sigma_db -e "SELECT key_id, enabled, entity_name, expires_at FROM api_keys;"
Common causes:
| Cause | Fix |
|---|---|
| Key disabled | UPDATE api_keys SET enabled=1 WHERE key_id='xxx'; |
| Key expired | UPDATE api_keys SET expires_at=NULL WHERE key_id='xxx'; |
| Wrong key format | Must start with smk_ prefix |
Integration Issues
Jira/ServiceNow Push Failing
Symptoms: Incidents created but no tickets appearing in Jira/SNOW
Diagnose:
# Check retry queue:
mysql -u root -p"$DB_PASS" sigma_db -e "
SELECT delivery_type, status, attempt_count, last_error, created_at
FROM delivery_retry_queue
ORDER BY created_at DESC LIMIT 10;"
# Check retry processor:
systemctl status sigma-retry-processor.timer
journalctl -u sigma-retry-processor -n 20 --no-pager
Common causes:
| Cause | Fix |
|---|---|
| Wrong Jira/SNOW credentials | Check integration config in System Settings UI |
| Network/proxy issue | Test curl https://your-jira-instance.com from the server |
| Auto-push not enabled | Check AUTO_PUSH_MIN_SEVERITY in .env.production |
| Retry timer stopped | systemctl start sigma-retry-processor.timer |
Syslog Sender Not Forwarding
Symptoms: External SIEM not receiving alerts from RhythmX
Diagnose:
systemctl status sigma-syslog-sender
curl -s http://localhost:8888/health | python3 -m json.tool
# Check config:
cat /opt/Sigma_ML_MSSP/siem_sender/config.json
Common causes:
| Cause | Fix |
|---|---|
| Wrong target IP/port | Fix in config.json or System Settings UI |
| Target not reachable | telnet TARGET_IP 514 from the server |
| Service crashed | systemctl restart sigma-syslog-sender |
| No new alerts to send | Check sigma_alerts — pipeline may be stalled |
Performance Issues
High CPU / RAM
Diagnose:
# Top processes:
top -bn1 | head -15
# Per-service memory:
systemctl show api_gateway -p MemoryCurrent
ps aux --sort=-%mem | head -10
# Check for runaway processes:
ps aux | grep -E "python3|java|gunicorn" | grep -v grep
Common causes:
| Cause | Fix |
|---|---|
| Logstash heap too large | Reduce JVM heap in /etc/logstash/jvm.options |
| MySQL InnoDB buffer | Reduce innodb_buffer_pool_size in /etc/my.cnf |
| Too many Gunicorn workers | Reduce workers in api_gateway.service |
Slow API Responses
Diagnose:
# Test response times:
curl -sk -o /dev/null -w "Time: %{time_total}s\n" https://localhost/api/alarms/incidents?timeframe=1d&group_by=actor
# Check MySQL slow queries:
mysql -u root -p"$DB_PASS" -e "SHOW PROCESSLIST;" | grep -v Sleep
Common causes:
| Cause | Fix |
|---|---|
| Missing database indexes | Run python3 /opt/Sigma_ML_MSSP/Backend/initializer.py |
| Large sigma_alerts table | Check retention — consider archiving old data |
| Too many concurrent users | Increase Gunicorn workers |
Disk Full
Diagnose:
df -h
du -sh /var/lib/mysql /var/log/logstash /var/lib/logstash /opt/Sigma_ML_MSSP 2>/dev/null | sort -h
Common causes:
| Location | What fills it | Fix |
|---|---|---|
/var/lib/mysql |
sigma_alerts growing | Archive old data, increase retention cleanup |
/var/log/logstash |
Processed log files | Check log rotation is running |
/var/lib/logstash/queue |
Logstash persistent queue | Pipeline backlog — check output targets |
Quick Recovery Commands
# Restart everything (nuclear option):
systemctl restart mysqld
sleep 3
systemctl restart api_gateway logstash sigma_sql_backend risk_scoring \
isolation_forest sigma-case-correlation actor_cache_sync \
logrhythm-sync sigma-syslog-sender
systemctl restart nginx
# Check everything is up:
systemctl status api_gateway nginx mysqld logstash sigma_sql_backend \
sigma-case-correlation actor_cache_sync --no-pager | grep -E "Active:|●"
Log Locations
| Service | How to view logs |
|---|---|
| API Gateway | journalctl -u api_gateway -f |
| Nginx | /var/log/nginx/error.log and /var/log/nginx/access.log |
| MySQL | journalctl -u mysqld -f |
| Logstash | journalctl -u logstash -f |
| Detection pipeline | /var/log/logstash/rotation_errors.log |
| Correlation engine | journalctl -u sigma-case-correlation -f |
| Actor cache sync | journalctl -u actor_cache_sync -f |
| Incident sync | journalctl -u sigma-incident-sync -f |
| Syslog sender | journalctl -u sigma-syslog-sender -f |
| LDAP sync | journalctl -u ldapy_sync -f |