Troubleshooting Guide

Quick reference for diagnosing and fixing common issues in RhythmX production environments.

Service Health Check (First Step)

Run this to see the status of all RhythmX services:

systemctl status api_gateway nginx mysqld logstash sigma_sql_backend \
  risk_scoring isolation_forest sigma-case-correlation actor_cache_sync \
  logrhythm-sync sigma-syslog-sender --no-pager

Check all timers:

systemctl list-timers | grep -i sigma

Service Won't Start / Crashed

API Gateway (api_gateway.service)

Symptoms: UI not loading, all API calls fail, 502 Bad Gateway

Diagnose:

systemctl status api_gateway
journalctl -u api_gateway -n 50 --no-pager

Common causes and fixes:

Cause	Log message	Fix
MySQL down	`Can't connect to MySQL`	`systemctl start mysqld`
Port in use	`Address already in use`	`lsof -i :5000` then kill the stale process
Permission denied on .env	`Permission denied: .env.production`	`setfacl -m u:sigma_api:rw /opt/Sigma_ML_MSSP/.env.production`
Python package missing	`ModuleNotFoundError`	`pip3 install <missing_package>`
Bad .env syntax	`ValueError` on startup	Check `.env.production` for syntax errors

Restart:

systemctl restart api_gateway
# Verify:
curl -sk -o /dev/null -w "%{http_code}" https://localhost/
# Should return 200

Nginx

Symptoms: Can't reach the site at all, connection refused on 443

Diagnose:

systemctl status nginx
nginx -t  # Test config syntax
tail -20 /var/log/nginx/error.log

Common causes and fixes:

Cause	Fix
Config syntax error	`nginx -t` shows the error — fix the config file
SSL cert missing/expired	Check `/etc/nginx/ssl/` — regenerate self-signed cert
Port 443 already in use	`lsof -i :443` — kill conflicting process
SELinux blocking	`ausearch -m avc -ts recent` — add policy or set permissive

Restart:

nginx -t && systemctl restart nginx

MySQL (mysqld.service)

Symptoms: Everything fails — API, detections, incidents

Diagnose:

systemctl status mysqld
journalctl -u mysqld -n 50 --no-pager
mysql -u root -p"$(grep DB_PASSWORD /opt/Sigma_ML_MSSP/.env.production | cut -d= -f2)" -e "SELECT 1;"

Common causes and fixes:

Cause	Log message	Fix
Disk full	`No space left on device`	Free disk space, check `df -h`
InnoDB corruption	`InnoDB: corruption`	Restore from backup
Too many connections	`Too many connections`	`SET GLOBAL max_connections = 500;`
Wrong password	`Access denied`	Check `DB_PASSWORD` in `.env.production`
Socket file missing	`Can't connect through socket`	`systemctl restart mysqld`

Check database health:

DB_PASS=$(grep DB_PASSWORD /opt/Sigma_ML_MSSP/.env.production | cut -d= -f2)
mysql -u root -p"$DB_PASS" -e "
  SELECT 'sigma_alerts' as tbl, COUNT(*) as rows FROM sigma_alerts
  UNION SELECT 'threat_cases', COUNT(*) FROM threat_cases
  UNION SELECT 'incidents', COUNT(*) FROM incidents
  UNION SELECT 'alarm_actors', COUNT(*) FROM alarm_actors;"

Logstash (logstash.service)

Symptoms: No new logs being ingested, ports 5514/5515/5516 not responding

Diagnose:

systemctl status logstash
journalctl -u logstash -n 50 --no-pager
# Check if ports are listening:
ss -tlnp | grep -E "5514|5515|5516"

Common causes and fixes:

Cause	Fix
Pipeline config error	Check `/etc/logstash/conf.d/*.conf` for syntax
JVM out of memory	Increase heap in `/etc/logstash/jvm.options`
Port conflict	Another process on 5514/5515/5516 — kill it
Persisted queue corrupt	Remove `/var/lib/logstash/queue/` and restart

Check pipeline stats:

curl -s http://localhost:9600/_node/stats/pipelines | python3 -m json.tool | head -30

Detection Pipeline Issues

No New sigma_alerts Appearing

Symptoms: Investigation page shows stale data, no new detections

Diagnose:

# Check when the last detection was:
DB_PASS=$(grep DB_PASSWORD /opt/Sigma_ML_MSSP/.env.production | cut -d= -f2)
mysql -u root -p"$DB_PASS" sigma_db -e "SELECT MAX(system_time) as last_detection FROM sigma_alerts;"

# Check if rotation is running:
systemctl status sigma-log-rotate.timer
journalctl -u sigma-log-rotate -n 20 --no-pager

# Check if log files are growing:
ls -la /var/log/logstash/processed_syslog.xml

Common causes:

Cause	Fix
Logstash not receiving logs	Check source (LogRhythm/syslog) is sending to correct IP:port
Log rotation not triggering	`systemctl restart sigma-log-rotate.timer`
RhythmX Rules engine failing	Check `/var/log/logstash/rotation_errors.log`
Processed file empty	Source not sending — check network/firewall
Detection output dir full	Clean `/var/log/logstash/detected_rhythmx/`

No New Threat Cases

Symptoms: Threat cases section empty, no correlated attack patterns

Diagnose:

systemctl status sigma-case-correlation
journalctl -u sigma-case-correlation -n 30 --no-pager

# Check if sigma_alerts exist (cases are built from these):
mysql -u root -p"$DB_PASS" sigma_db -e "SELECT COUNT(*) FROM sigma_alerts WHERE system_time >= DATE_SUB(NOW(), INTERVAL 24 HOUR);"

Common causes:

Cause	Fix
No sigma_alerts	Fix detection pipeline first (see above)
Correlation service not running	`systemctl restart sigma-case-correlation`
All alerts already in cases	Check `case_alerts` table — alerts can't be reused
Thresholds too high	Check use case YAML files in `/opt/Sigma_ML_MSSP/cases/config/use_cases/`

Incident Issues

Incidents Not Being Created

Symptoms: Qualifying actors in alarms page but no incidents in /incidents

Diagnose:

# Is the timer running?
systemctl status sigma-incident-sync.timer
systemctl list-timers | grep incident

# Check last run:
journalctl -u sigma-incident-sync -n 30 --no-pager

# Check actor_risk_summary for qualifying actors:
mysql -u root -p"$DB_PASS" sigma_db -e "
  SELECT actor_key, risk_score, risk_level, threat_case_count, last_seen
  FROM actor_risk_summary
  WHERE risk_score >= 70 OR threat_case_count >= 1
  ORDER BY risk_score DESC LIMIT 10;"

Common causes:

Cause	Fix
Timer not running	`systemctl start sigma-incident-sync.timer`
actor_risk_summary empty	`systemctl restart actor_cache_sync` and wait 5 minutes
Risk score below threshold	Check `INCIDENT_MIN_RISK_SCORE` in `.env.production` (default: 70)
last_seen too old (Path B)	Activity must be within `INCIDENT_RISK_FRESHNESS_HOURS` (default: 24h)
DB connection error	Check MySQL is running and password is correct

Duplicate Incidents

Symptoms: Same actor has multiple OPEN incidents

Diagnose:

mysql -u root -p"$DB_PASS" sigma_db -e "
  SELECT group_key, COUNT(*) as incident_count
  FROM incidents WHERE status = 'OPEN'
  GROUP BY group_key HAVING COUNT(*) > 1;"

Fix: This was a bug in the old timestamp-based ID system. The new system uses hash(actor_key + actor_type + entity + sequence). To clean up:

# Keep the most recent incident per actor, delete duplicates:
mysql -u root -p"$DB_PASS" sigma_db -e "
  DELETE i FROM incidents i
  INNER JOIN (
    SELECT group_key, MAX(id) as keep_id
    FROM incidents WHERE status = 'OPEN'
    GROUP BY group_key
  ) keep ON i.group_key = keep.group_key
  WHERE i.id != keep.keep_id AND i.status = 'OPEN';"

Auto-Close Not Working

Symptoms: Old incidents staying OPEN even with no activity

Diagnose:

# Check if incidents should have been closed:
mysql -u root -p"$DB_PASS" sigma_db -e "
  SELECT incident_id, group_key, status, last_alarm_time,
    TIMESTAMPDIFF(HOUR, last_alarm_time, NOW()) as hours_since_last
  FROM incidents
  WHERE status IN ('OPEN', 'IN_PROGRESS')
  AND last_alarm_time < DATE_SUB(NOW(), INTERVAL 24 HOUR);"

# Check if actor still has open threat cases (blocks auto-close):
mysql -u root -p"$DB_PASS" sigma_db -e "
  SELECT i.group_key, COALESCE(a.threat_case_count, 0) as cases
  FROM incidents i
  LEFT JOIN actor_risk_summary a ON i.group_key = a.actor_key
  WHERE i.status = 'OPEN';"

Common causes:

Cause	Fix
Timer not running	`systemctl start sigma-incident-sync.timer`
Actor has open threat cases	Auto-close is blocked (by design) — close cases first
`INCIDENT_AUTO_CLOSE_HOURS` too high	Check `.env.production` (default: 24)

Authentication Issues

Symptoms: Login page shows error, or page keeps showing "Session expired"

Diagnose:

# Test login API directly:
curl -sk -X POST https://localhost/api/login \
  -H "Content-Type: application/json" \
  -d '{"username":"rhythmx","password":"YOUR_PASSWORD"}'

# Check JWT config:
grep JWT_SECRET /opt/Sigma_ML_MSSP/.env.production | head -1

Common causes:

Cause	Fix
Wrong password	Check `DEFAULT_ADMIN_PASSWORD` in `.env.production`
JWT secret changed	All existing tokens invalidated — users must re-login
API Gateway down	`systemctl restart api_gateway`
Browser cache	Hard refresh `Ctrl+Shift+R` or clear cookies

LDAP Sync Failing

Symptoms: AD users not showing in RhythmX, LDAP users can't login

Diagnose:

systemctl status ldapy_sync.timer
journalctl -u ldapy_sync -n 30 --no-pager

# Test LDAP connection manually:
python3 -c "
from ldap3 import Server, Connection
s = Server('YOUR_LDAP_SERVER', port=389)
c = Connection(s, 'DOMAIN\\\\user', 'password')
print('Connected:', c.bind())
"

Common causes:

Cause	Fix
Wrong LDAP credentials	Check `LDAP_USER` and `LDAP_PASSWORD` in `.env.production`
LDAP server unreachable	Check network/firewall — `telnet LDAP_SERVER 389`
Invalid Base DN	Verify `LDAP_BASE_DN` format (e.g., `DC=company,DC=local`)
Timer not running	`systemctl start ldapy_sync.timer`

External API Key Not Working

Symptoms: External feed API returns 401

Diagnose:

# Test the key:
curl -sk -w "%{http_code}" https://localhost/api/v1/feed/health \
  -H "X-API-Key: YOUR_KEY"

# Check if key exists and is enabled:
mysql -u root -p"$DB_PASS" sigma_db -e "SELECT key_id, enabled, entity_name, expires_at FROM api_keys;"

Common causes:

Cause	Fix
Key disabled	`UPDATE api_keys SET enabled=1 WHERE key_id='xxx';`
Key expired	`UPDATE api_keys SET expires_at=NULL WHERE key_id='xxx';`
Wrong key format	Must start with `smk_` prefix

Integration Issues

Jira/ServiceNow Push Failing

Symptoms: Incidents created but no tickets appearing in Jira/SNOW

Diagnose:

# Check retry queue:
mysql -u root -p"$DB_PASS" sigma_db -e "
  SELECT delivery_type, status, attempt_count, last_error, created_at
  FROM delivery_retry_queue
  ORDER BY created_at DESC LIMIT 10;"

# Check retry processor:
systemctl status sigma-retry-processor.timer
journalctl -u sigma-retry-processor -n 20 --no-pager

Common causes:

Cause	Fix
Wrong Jira/SNOW credentials	Check integration config in System Settings UI
Network/proxy issue	Test `curl https://your-jira-instance.com` from the server
Auto-push not enabled	Check `AUTO_PUSH_MIN_SEVERITY` in `.env.production`
Retry timer stopped	`systemctl start sigma-retry-processor.timer`

Syslog Sender Not Forwarding

Symptoms: External SIEM not receiving alerts from RhythmX

Diagnose:

systemctl status sigma-syslog-sender
curl -s http://localhost:8888/health | python3 -m json.tool

# Check config:
cat /opt/Sigma_ML_MSSP/siem_sender/config.json

Common causes:

Cause	Fix
Wrong target IP/port	Fix in `config.json` or System Settings UI
Target not reachable	`telnet TARGET_IP 514` from the server
Service crashed	`systemctl restart sigma-syslog-sender`
No new alerts to send	Check sigma_alerts — pipeline may be stalled

Performance Issues

High CPU / RAM

Diagnose:

# Top processes:
top -bn1 | head -15

# Per-service memory:
systemctl show api_gateway -p MemoryCurrent
ps aux --sort=-%mem | head -10

# Check for runaway processes:
ps aux | grep -E "python3|java|gunicorn" | grep -v grep

Common causes:

Cause	Fix
Logstash heap too large	Reduce JVM heap in `/etc/logstash/jvm.options`
MySQL InnoDB buffer	Reduce `innodb_buffer_pool_size` in `/etc/my.cnf`
Too many Gunicorn workers	Reduce workers in `api_gateway.service`

Slow API Responses

Diagnose:

# Test response times:
curl -sk -o /dev/null -w "Time: %{time_total}s\n" https://localhost/api/alarms/incidents?timeframe=1d&group_by=actor

# Check MySQL slow queries:
mysql -u root -p"$DB_PASS" -e "SHOW PROCESSLIST;" | grep -v Sleep

Common causes:

Cause	Fix
Missing database indexes	Run `python3 /opt/Sigma_ML_MSSP/Backend/initializer.py`
Large sigma_alerts table	Check retention — consider archiving old data
Too many concurrent users	Increase Gunicorn workers

Disk Full

Diagnose:

df -h
du -sh /var/lib/mysql /var/log/logstash /var/lib/logstash /opt/Sigma_ML_MSSP 2>/dev/null | sort -h

Common causes:

Location	What fills it	Fix
`/var/lib/mysql`	sigma_alerts growing	Archive old data, increase retention cleanup
`/var/log/logstash`	Processed log files	Check log rotation is running
`/var/lib/logstash/queue`	Logstash persistent queue	Pipeline backlog — check output targets

Quick Recovery Commands

# Restart everything (nuclear option):
systemctl restart mysqld
sleep 3
systemctl restart api_gateway logstash sigma_sql_backend risk_scoring \
  isolation_forest sigma-case-correlation actor_cache_sync \
  logrhythm-sync sigma-syslog-sender
systemctl restart nginx

# Check everything is up:
systemctl status api_gateway nginx mysqld logstash sigma_sql_backend \
  sigma-case-correlation actor_cache_sync --no-pager | grep -E "Active:|●"

Log Locations

Service	How to view logs
API Gateway	`journalctl -u api_gateway -f`
Nginx	`/var/log/nginx/error.log` and `/var/log/nginx/access.log`
MySQL	`journalctl -u mysqld -f`
Logstash	`journalctl -u logstash -f`
Detection pipeline	`/var/log/logstash/rotation_errors.log`
Correlation engine	`journalctl -u sigma-case-correlation -f`
Actor cache sync	`journalctl -u actor_cache_sync -f`
Incident sync	`journalctl -u sigma-incident-sync -f`
Syslog sender	`journalctl -u sigma-syslog-sender -f`
LDAP sync	`journalctl -u ldapy_sync -f`

Troubleshooting Guide

Service Health Check (First Step)

Service Won't Start / Crashed

API Gateway (api_gateway.service)

Nginx

MySQL (mysqld.service)

Logstash (logstash.service)

Detection Pipeline Issues

No New sigma_alerts Appearing

No New Threat Cases

Incident Issues

Incidents Not Being Created

Duplicate Incidents

Auto-Close Not Working

Authentication Issues

Can't Login / JWT Expired

LDAP Sync Failing

External API Key Not Working

Integration Issues

Jira/ServiceNow Push Failing

Syslog Sender Not Forwarding

Performance Issues

High CPU / RAM

Slow API Responses

Disk Full

Quick Recovery Commands

Log Locations