Operating system (OS) kernel panics, which occur when the OS encounters unrecoverable fatal errors, pose significant challenges to the stability and reliability of ByteDance’s large-scale cloud services. While analyzing kernel panic logs is crucial for diagnosing root causes and preventing future occurrences, kernel panic root cause analysis (RCA) faces two major challenges: (1) the scarcity of logs that directly indicate kernel panics, and (2) the presence of complex long-range dependencies within kernel panic logs that complicate root cause identification. To address these challenges, we propose LogSage, a novel log-based framework for kernel panic RCA in large-scale cloud services. LogSage employs a hybrid approach that combines clustering techniques and large language models (LLMs) to effectively extract fault-indicating logs, while leveraging Graph Neural Networks (GNNs) to capture and model the intricate relationships within these logs. We evaluated LogSage using a comprehensive dataset of 20,000 kernel panics from ByteDance’s production environment. The results demonstrate that LogSage significantly outperforms existing methods in both accuracy and robustness of root cause identification. Moreover, a six-month deployment of LogSage in ByteDance’s cloud infrastructure has proven its practical value, successfully assisting engineers in real-world kernel panic RCA tasks, as validated through multiple case studies.
graph TD A[Log Input] --> B[Preprocessing] B --> C[Graph Construction] C --> D[GNN Model Training] D --> E[Model Serving] E --> F[API Interface] F --> G[Frontend Display] C --> H[Graph Storage] D --> H
cd LogSage
python -m venv venv source venv/bin/activate # Linux/Mac venv\Scripts\activate # Windows
pip install -r requirements.txt
python scripts/preprocess.py --input data/raw/ --output data/processed/
python scripts/train_model.py --model graphsage --epochs 50 --gpu
python scripts/train_model.py --model gat --epochs 50 --gpu
python scripts/train_model.py --model gcn --epochs 50 --gpu
python src/serving/api_server.py
gunicorn -w 4 -k uvicorn.workers.UvicornWorker src.serving.api_server:app
LogSage/ ├── configs/ # Configuration files │ ├── model_config.yaml │ └── server_config.yaml ├── data/ # Data directory │ ├── raw/ # Raw logs │ ├── processed/ # Processed data │ └── streaming/ # Streaming log input ├── docs/ # Documentation ├── models/ # Trained models ├── scripts/ # Utility scripts │ ├── preprocess.py │ ├── train_model.py │ └── start_server.sh ├── src/ # Source code │ ├── KPRCA_stage/ # Core algorithms │ │ ├── models/ # Model definitions │ │ ├── trainer.py # Training logic │ │ └── graph_constructor.py │ └── serving/ # Server-side code │ ├── api_server.py │ └── graph_updater.py ├── tests/ # Unit tests ├── requirements.txt # Python dependencies └── README.md # This document
- Create a new model file in src/KPRCA_stage/models/
- Implement the base model interface
- Update trainer.py to support the new model
- Add test cases
- Fork the repository
- Create your feature branch (git checkout -b feature AmazingFeature)
- Commit your changes (git commit -m 'Add some AmazingFeature')
- Push to the branch (git push origin feature/AmazingFeature)
- Open a Pull Request
Apache License 2.0
If you use LogSage in your research, please cite
For technical inquiries: Email: [cuitianyu@mail.nankai.edu.cn]