Skip to content

Troubleshooting Guide

Quick reference for diagnosing and fixing common issues in RhythmX production environments.


Service Health Check (First Step)

Run this to see the status of all RhythmX services:

systemctl status api_gateway nginx mysqld logstash sigma_sql_backend \
  risk_scoring isolation_forest sigma-case-correlation actor_cache_sync \
  logrhythm-sync sigma-syslog-sender --no-pager

Check all timers:

systemctl list-timers | grep -i sigma

Service Won't Start / Crashed

API Gateway (api_gateway.service)

Symptoms: UI not loading, all API calls fail, 502 Bad Gateway

Diagnose:

systemctl status api_gateway
journalctl -u api_gateway -n 50 --no-pager

Common causes and fixes:

Cause Log message Fix
MySQL down Can't connect to MySQL systemctl start mysqld
Port in use Address already in use lsof -i :5000 then kill the stale process
Permission denied on .env Permission denied: .env.production setfacl -m u:sigma_api:rw /opt/Sigma_ML_MSSP/.env.production
Python package missing ModuleNotFoundError pip3 install <missing_package>
Bad .env syntax ValueError on startup Check .env.production for syntax errors

Restart:

systemctl restart api_gateway
# Verify:
curl -sk -o /dev/null -w "%{http_code}" https://localhost/
# Should return 200


Nginx

Symptoms: Can't reach the site at all, connection refused on 443

Diagnose:

systemctl status nginx
nginx -t  # Test config syntax
tail -20 /var/log/nginx/error.log

Common causes and fixes:

Cause Fix
Config syntax error nginx -t shows the error — fix the config file
SSL cert missing/expired Check /etc/nginx/ssl/ — regenerate self-signed cert
Port 443 already in use lsof -i :443 — kill conflicting process
SELinux blocking ausearch -m avc -ts recent — add policy or set permissive

Restart:

nginx -t && systemctl restart nginx


MySQL (mysqld.service)

Symptoms: Everything fails — API, detections, incidents

Diagnose:

systemctl status mysqld
journalctl -u mysqld -n 50 --no-pager
mysql -u root -p"$(grep DB_PASSWORD /opt/Sigma_ML_MSSP/.env.production | cut -d= -f2)" -e "SELECT 1;"

Common causes and fixes:

Cause Log message Fix
Disk full No space left on device Free disk space, check df -h
InnoDB corruption InnoDB: corruption Restore from backup
Too many connections Too many connections SET GLOBAL max_connections = 500;
Wrong password Access denied Check DB_PASSWORD in .env.production
Socket file missing Can't connect through socket systemctl restart mysqld

Check database health:

DB_PASS=$(grep DB_PASSWORD /opt/Sigma_ML_MSSP/.env.production | cut -d= -f2)
mysql -u root -p"$DB_PASS" -e "
  SELECT 'sigma_alerts' as tbl, COUNT(*) as rows FROM sigma_alerts
  UNION SELECT 'threat_cases', COUNT(*) FROM threat_cases
  UNION SELECT 'incidents', COUNT(*) FROM incidents
  UNION SELECT 'alarm_actors', COUNT(*) FROM alarm_actors;"


Logstash (logstash.service)

Symptoms: No new logs being ingested, ports 5514/5515/5516 not responding

Diagnose:

systemctl status logstash
journalctl -u logstash -n 50 --no-pager
# Check if ports are listening:
ss -tlnp | grep -E "5514|5515|5516"

Common causes and fixes:

Cause Fix
Pipeline config error Check /etc/logstash/conf.d/*.conf for syntax
JVM out of memory Increase heap in /etc/logstash/jvm.options
Port conflict Another process on 5514/5515/5516 — kill it
Persisted queue corrupt Remove /var/lib/logstash/queue/ and restart

Check pipeline stats:

curl -s http://localhost:9600/_node/stats/pipelines | python3 -m json.tool | head -30


Detection Pipeline Issues

No New sigma_alerts Appearing

Symptoms: Investigation page shows stale data, no new detections

Diagnose:

# Check when the last detection was:
DB_PASS=$(grep DB_PASSWORD /opt/Sigma_ML_MSSP/.env.production | cut -d= -f2)
mysql -u root -p"$DB_PASS" sigma_db -e "SELECT MAX(system_time) as last_detection FROM sigma_alerts;"

# Check if rotation is running:
systemctl status sigma-log-rotate.timer
journalctl -u sigma-log-rotate -n 20 --no-pager

# Check if log files are growing:
ls -la /var/log/logstash/processed_syslog.xml

Common causes:

Cause Fix
Logstash not receiving logs Check source (LogRhythm/syslog) is sending to correct IP:port
Log rotation not triggering systemctl restart sigma-log-rotate.timer
RhythmX Rules engine failing Check /var/log/logstash/rotation_errors.log
Processed file empty Source not sending — check network/firewall
Detection output dir full Clean /var/log/logstash/detected_rhythmx/

No New Threat Cases

Symptoms: Threat cases section empty, no correlated attack patterns

Diagnose:

systemctl status sigma-case-correlation
journalctl -u sigma-case-correlation -n 30 --no-pager

# Check if sigma_alerts exist (cases are built from these):
mysql -u root -p"$DB_PASS" sigma_db -e "SELECT COUNT(*) FROM sigma_alerts WHERE system_time >= DATE_SUB(NOW(), INTERVAL 24 HOUR);"

Common causes:

Cause Fix
No sigma_alerts Fix detection pipeline first (see above)
Correlation service not running systemctl restart sigma-case-correlation
All alerts already in cases Check case_alerts table — alerts can't be reused
Thresholds too high Check use case YAML files in /opt/Sigma_ML_MSSP/cases/config/use_cases/

Incident Issues

Incidents Not Being Created

Symptoms: Qualifying actors in alarms page but no incidents in /incidents

Diagnose:

# Is the timer running?
systemctl status sigma-incident-sync.timer
systemctl list-timers | grep incident

# Check last run:
journalctl -u sigma-incident-sync -n 30 --no-pager

# Check actor_risk_summary for qualifying actors:
mysql -u root -p"$DB_PASS" sigma_db -e "
  SELECT actor_key, risk_score, risk_level, threat_case_count, last_seen
  FROM actor_risk_summary
  WHERE risk_score >= 70 OR threat_case_count >= 1
  ORDER BY risk_score DESC LIMIT 10;"

Common causes:

Cause Fix
Timer not running systemctl start sigma-incident-sync.timer
actor_risk_summary empty systemctl restart actor_cache_sync and wait 5 minutes
Risk score below threshold Check INCIDENT_MIN_RISK_SCORE in .env.production (default: 70)
last_seen too old (Path B) Activity must be within INCIDENT_RISK_FRESHNESS_HOURS (default: 24h)
DB connection error Check MySQL is running and password is correct

Duplicate Incidents

Symptoms: Same actor has multiple OPEN incidents

Diagnose:

mysql -u root -p"$DB_PASS" sigma_db -e "
  SELECT group_key, COUNT(*) as incident_count
  FROM incidents WHERE status = 'OPEN'
  GROUP BY group_key HAVING COUNT(*) > 1;"

Fix: This was a bug in the old timestamp-based ID system. The new system uses hash(actor_key + actor_type + entity + sequence). To clean up:

# Keep the most recent incident per actor, delete duplicates:
mysql -u root -p"$DB_PASS" sigma_db -e "
  DELETE i FROM incidents i
  INNER JOIN (
    SELECT group_key, MAX(id) as keep_id
    FROM incidents WHERE status = 'OPEN'
    GROUP BY group_key
  ) keep ON i.group_key = keep.group_key
  WHERE i.id != keep.keep_id AND i.status = 'OPEN';"

Auto-Close Not Working

Symptoms: Old incidents staying OPEN even with no activity

Diagnose:

# Check if incidents should have been closed:
mysql -u root -p"$DB_PASS" sigma_db -e "
  SELECT incident_id, group_key, status, last_alarm_time,
    TIMESTAMPDIFF(HOUR, last_alarm_time, NOW()) as hours_since_last
  FROM incidents
  WHERE status IN ('OPEN', 'IN_PROGRESS')
  AND last_alarm_time < DATE_SUB(NOW(), INTERVAL 24 HOUR);"

# Check if actor still has open threat cases (blocks auto-close):
mysql -u root -p"$DB_PASS" sigma_db -e "
  SELECT i.group_key, COALESCE(a.threat_case_count, 0) as cases
  FROM incidents i
  LEFT JOIN actor_risk_summary a ON i.group_key = a.actor_key
  WHERE i.status = 'OPEN';"

Common causes:

Cause Fix
Timer not running systemctl start sigma-incident-sync.timer
Actor has open threat cases Auto-close is blocked (by design) — close cases first
INCIDENT_AUTO_CLOSE_HOURS too high Check .env.production (default: 24)

Authentication Issues

Can't Login / JWT Expired

Symptoms: Login page shows error, or page keeps showing "Session expired"

Diagnose:

# Test login API directly:
curl -sk -X POST https://localhost/api/login \
  -H "Content-Type: application/json" \
  -d '{"username":"rhythmx","password":"YOUR_PASSWORD"}'

# Check JWT config:
grep JWT_SECRET /opt/Sigma_ML_MSSP/.env.production | head -1

Common causes:

Cause Fix
Wrong password Check DEFAULT_ADMIN_PASSWORD in .env.production
JWT secret changed All existing tokens invalidated — users must re-login
API Gateway down systemctl restart api_gateway
Browser cache Hard refresh Ctrl+Shift+R or clear cookies

LDAP Sync Failing

Symptoms: AD users not showing in RhythmX, LDAP users can't login

Diagnose:

systemctl status ldapy_sync.timer
journalctl -u ldapy_sync -n 30 --no-pager

# Test LDAP connection manually:
python3 -c "
from ldap3 import Server, Connection
s = Server('YOUR_LDAP_SERVER', port=389)
c = Connection(s, 'DOMAIN\\\\user', 'password')
print('Connected:', c.bind())
"

Common causes:

Cause Fix
Wrong LDAP credentials Check LDAP_USER and LDAP_PASSWORD in .env.production
LDAP server unreachable Check network/firewall — telnet LDAP_SERVER 389
Invalid Base DN Verify LDAP_BASE_DN format (e.g., DC=company,DC=local)
Timer not running systemctl start ldapy_sync.timer

External API Key Not Working

Symptoms: External feed API returns 401

Diagnose:

# Test the key:
curl -sk -w "%{http_code}" https://localhost/api/v1/feed/health \
  -H "X-API-Key: YOUR_KEY"

# Check if key exists and is enabled:
mysql -u root -p"$DB_PASS" sigma_db -e "SELECT key_id, enabled, entity_name, expires_at FROM api_keys;"

Common causes:

Cause Fix
Key disabled UPDATE api_keys SET enabled=1 WHERE key_id='xxx';
Key expired UPDATE api_keys SET expires_at=NULL WHERE key_id='xxx';
Wrong key format Must start with smk_ prefix

Integration Issues

Jira/ServiceNow Push Failing

Symptoms: Incidents created but no tickets appearing in Jira/SNOW

Diagnose:

# Check retry queue:
mysql -u root -p"$DB_PASS" sigma_db -e "
  SELECT delivery_type, status, attempt_count, last_error, created_at
  FROM delivery_retry_queue
  ORDER BY created_at DESC LIMIT 10;"

# Check retry processor:
systemctl status sigma-retry-processor.timer
journalctl -u sigma-retry-processor -n 20 --no-pager

Common causes:

Cause Fix
Wrong Jira/SNOW credentials Check integration config in System Settings UI
Network/proxy issue Test curl https://your-jira-instance.com from the server
Auto-push not enabled Check AUTO_PUSH_MIN_SEVERITY in .env.production
Retry timer stopped systemctl start sigma-retry-processor.timer

Syslog Sender Not Forwarding

Symptoms: External SIEM not receiving alerts from RhythmX

Diagnose:

systemctl status sigma-syslog-sender
curl -s http://localhost:8888/health | python3 -m json.tool

# Check config:
cat /opt/Sigma_ML_MSSP/siem_sender/config.json

Common causes:

Cause Fix
Wrong target IP/port Fix in config.json or System Settings UI
Target not reachable telnet TARGET_IP 514 from the server
Service crashed systemctl restart sigma-syslog-sender
No new alerts to send Check sigma_alerts — pipeline may be stalled

Performance Issues

High CPU / RAM

Diagnose:

# Top processes:
top -bn1 | head -15

# Per-service memory:
systemctl show api_gateway -p MemoryCurrent
ps aux --sort=-%mem | head -10

# Check for runaway processes:
ps aux | grep -E "python3|java|gunicorn" | grep -v grep

Common causes:

Cause Fix
Logstash heap too large Reduce JVM heap in /etc/logstash/jvm.options
MySQL InnoDB buffer Reduce innodb_buffer_pool_size in /etc/my.cnf
Too many Gunicorn workers Reduce workers in api_gateway.service

Slow API Responses

Diagnose:

# Test response times:
curl -sk -o /dev/null -w "Time: %{time_total}s\n" https://localhost/api/alarms/incidents?timeframe=1d&group_by=actor

# Check MySQL slow queries:
mysql -u root -p"$DB_PASS" -e "SHOW PROCESSLIST;" | grep -v Sleep

Common causes:

Cause Fix
Missing database indexes Run python3 /opt/Sigma_ML_MSSP/Backend/initializer.py
Large sigma_alerts table Check retention — consider archiving old data
Too many concurrent users Increase Gunicorn workers

Disk Full

Diagnose:

df -h
du -sh /var/lib/mysql /var/log/logstash /var/lib/logstash /opt/Sigma_ML_MSSP 2>/dev/null | sort -h

Common causes:

Location What fills it Fix
/var/lib/mysql sigma_alerts growing Archive old data, increase retention cleanup
/var/log/logstash Processed log files Check log rotation is running
/var/lib/logstash/queue Logstash persistent queue Pipeline backlog — check output targets

Quick Recovery Commands

# Restart everything (nuclear option):
systemctl restart mysqld
sleep 3
systemctl restart api_gateway logstash sigma_sql_backend risk_scoring \
  isolation_forest sigma-case-correlation actor_cache_sync \
  logrhythm-sync sigma-syslog-sender
systemctl restart nginx

# Check everything is up:
systemctl status api_gateway nginx mysqld logstash sigma_sql_backend \
  sigma-case-correlation actor_cache_sync --no-pager | grep -E "Active:|●"

Log Locations

Service How to view logs
API Gateway journalctl -u api_gateway -f
Nginx /var/log/nginx/error.log and /var/log/nginx/access.log
MySQL journalctl -u mysqld -f
Logstash journalctl -u logstash -f
Detection pipeline /var/log/logstash/rotation_errors.log
Correlation engine journalctl -u sigma-case-correlation -f
Actor cache sync journalctl -u actor_cache_sync -f
Incident sync journalctl -u sigma-incident-sync -f
Syslog sender journalctl -u sigma-syslog-sender -f
LDAP sync journalctl -u ldapy_sync -f