-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Panic after power fail #1883
Comments
also related to #1797 |
It's not crashing every 5min. We take a complete backup every 5min and keep them in min/hour/day folders. So after a crash, we can recover. Sometimes the DB is just corrupted, and we can't detect that. So 5min backups start being overwritten with insufficient data. Thinking about always backing up during shutdown. I don't know what to do about the silent loss of keys. |
If you could share your last backup and the data after immediate crash (If it is not confidential) and along with logs, that would give up good amount of information to debug this. As I said, we currently do not run our tests on windows, I am not sure if this will change in future. mmap could be acting different on windows leading to corruption of the db. I do see quite a few open issues on windows related to this, maybe that pushes us to at least ensure our tests pass on windows. Thanks for information in any case. |
I appreciate your interest. I'll try to gather more data for you. Confidential data is an issue. |
How do you do this? Since it's ACID compliant, it should work. But only if you take an atomic filesystem snapshot. Copying the database just with rsync during it's being used certainly is not the way to go. |
A task reads all keys/values and writes them to a file. |
panic: runtime error: slice bounds out of range [-16:] [recovered] goroutine 58 [running]: |
That's not a clean and consistent approach. If you have to do without shutdown, you have to do it atomically on filesystem level and rely on ACID conformity and recovery. |
It can't be shut down every 5min to do a backup. |
Do an atomic file-system snapshot. They are provided by e.g. zfs, btrfs. Or if you have the money storage systems that support them like from Netapp (but also others ofc because this is standard) This should allow you to do hot-backups with a consistent state because, as said, it's an atomic operation.
Oh Windows ....... Hmm, this is a problem. |
Badger provides snapshot isolation. That means, if you use a txn to read data during backup, you will get a consistent snapshot of the data. You don't need to worry about anything in that case and your snapshot is a consistent view of the data at a given point in time. |
This is indeed a problem, but the main issue is that the exception cannot be caught through err or recover, which will result in continuous crashes of the program and make subsequent processing methods unachievable. package main
import (
"log"
"github.com/dgraph-io/badger/v4"
)
func main() {
// can not recover
defer func() {
err := recover()
if err != nil {
log.Fatal("catch panic: ", err)
}
}()
opt := badger.DefaultOptions("../db")
opt.WithMemTableSize(1 << 20)
opt.WithValueThreshold(1 << 10)
opt.WithValueLogFileSize(1 << 20)
opt.WithLoggingLevel(badger.DEBUG)
db, err := badger.Open(opt)
// no err
if err != nil {
log.Fatal("have err: ", err)
}
db.Close()
}
|
Why would that be? You should be able to recover from panic using the recover() function. I don't have a windows machine, but do you have any insights? |
The example that Gemone gave isn't crashing for me. Now use this "github.com/gorpher/gowin32/session_notifications" to close the DB before a restart. |
@mangalaman93 This issue is not related to the system, but rather caused by the power outage that resulted in database file corruption. This issue would occur on any system. The inability to recover() may be due to an internal recover() call within Open(), or a panic occurring within a goroutine. Line 353 in f9b9e4d
@dgallion1 I am using a program created with a service that automatically restarts in case of a crash. However, whenever the DB file is corrupted, it results in a panic that cannot recover(), which causes the service to keep restarting. Go version: 1.20.3 |
If you could help us reproduce it on linux, I will be happy to look into this on higher priority. |
I have the DB that caused this and will look into it more. badger 2023/06/22 07:56:21 INFO: All 6 tables opened in 212ms goroutine 190 [running]: |
This code crashes and isn't caught. Same thing in Badger.
I was able to catch the crash here.
|
Is there any way to resolve this, this is becoming frequent in my case. |
@Akhilesh53 My current approach is to store the cached state in the registry. If the Badger does not start properly, I will delete the damaged files and restart the Badger. For reference only. From the source code, it seems that this error was not passed for processing, which makes it difficult for me to handle the error properly. |
But why badger is creating corrupted files and cannot read the same itself? I am these three errors continuously, mentioning below.
negative offset :: -1147080149
|
I also encountered this on v3.2103.2 while writing unit tests on macOS that simulate a power loss and corruption on an embedded device. The following app causes a panic. I think this could be handled better by adding bounds checking to https://github.com/dgraph-io/badger/blob/v3.2103.2/table/table.go#L430
|
I often encounter this problem. The database will be damaged when the Windows system is physically powered off or has a blue screen. There should be data that has not been written to the disk. |
Hey guys I wrote a draft PR on a badger fork for internal use on the company where I work on, and on my manual tests it fixed this issue by allowing the caller of the @mangalaman93 if you think this solution looks promising I can polish it up, add some tests and create a proper PR for the official badger repo. Let me know what you think. Oh and I can't share the database I am using because it contains info private to my current company, but I will try to produce a broken database by repeatedly forcibly killing a Go process that is reading and writing to a sample database. I will try to get it done tonight.
|
@mangalaman93 I just created a new PR based on badger V4: |
I could reproduce the panic by replacing an *.sst file with an empty *.sst. It would be great to have this fixed, so we can react when a database is corrupt |
This issue has been stale for 60 days and will be closed automatically in 7 days. Comment to keep it open. |
What version of Badger are you using?
v3.2103.2
What version of Go are you using?
go version go1.19.4 windows/amd64
Have you tried reproducing the issue with the latest release?
None
What is the hardware spec (RAM, CPU, OS)?
OS Name Microsoft Windows 11 Pro
Version 10.0.22621 Build 22621
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name AMD2
System Manufacturer ASUS
System Model System Product Name
System Type x64-based PC
System SKU SKU
Processor AMD Ryzen 9 7950X 16-Core Processor, 4501 Mhz, 16 Core(s), 32 Logical Processor(s)
BIOS Version/Date American Megatrends Inc. 0705, 10/5/2022
SMBIOS Version 3.5
Embedded Controller Version 255.255
BIOS Mode UEFI
BaseBoard Manufacturer ASUSTeK COMPUTER INC.
BaseBoard Product ROG CROSSHAIR X670E EXTREME
BaseBoard Version Rev 1.xx
Platform Role Desktop
Secure Boot State Off
PCR7 Configuration Elevation Required to View
Windows Directory C:\Windows
System Directory C:\Windows\system32
Boot Device \Device\HarddiskVolume1
Locale United States
Hardware Abstraction Layer Version = "10.0.22621.819"
User Name AMD2\dgall
Time Zone Eastern Standard Time
Installed Physical Memory (RAM) 128 GB
Total Physical Memory 127 GB
Available Physical Memory 108 GB
Total Virtual Memory 145 GB
Available Virtual Memory 119 GB
Page File Space 18.0 GB
Page File C:\pagefile.sys
Kernel DMA Protection Off
Virtualization-based security Running
Virtualization-based security Required Security Properties
Virtualization-based security Available Security Properties Base Virtualization Support, DMA Protection, UEFI Code Readonly, SMM Security Mitigations 1.0, Mode Based Execution Control
Virtualization-based security Services Configured Hypervisor enforced Code Integrity
Virtualization-based security Services Running Credential Guard, Hypervisor enforced Code Integrity
Windows Defender Application Control policy Enforced
Windows Defender Application Control user mode policy Off
Device Encryption Support Elevation Required to View
A hypervisor has been detected. Features required for Hyper-V will not be displayed.
What steps will reproduce the bug?
Power removed.
Expected behavior and actual result.
After a power failure, running Windows 11.
panic: runtime error: slice bounds out of range [-16:] [recovered]
panic:
== Recovering from initIndex crash ==
File Info: [ID: 1157, Size: 8545749, Zeros: 8545749]
isEnrypted: true checksumLen: 0 checksum: indexLen: 0 index: []
== Recovered ==
goroutine 33 [running]:
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1.1()
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:351 +0xa8
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1()
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:397 +0xc2
panic({0x7ff7462d5cc0, 0xc00059e018})
C:/Program Files/Go/src/runtime/panic.go:884 +0x212
github.com/dgraph-io/badger/v3/table.(*Table).decrypt(0xc00087c030?, {0x22decfa65cd?, 0xc000159970?, 0x7ff745b5e2ce?}, 0xcd?)
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:753 +0x16b
github.com/dgraph-io/badger/v3/table.(*Table).readTableIndex(0xc00021ee40)
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:702 +0x5c
github.com/dgraph-io/badger/v3/table.(*Table).initIndex(0xc00021ee40)
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:462 +0x19d
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest(0xc00021ee40)
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:401 +0x7f
github.com/dgraph-io/badger/v3/table.OpenTable(0xc000142260, {0x0, 0x1, 0x200000, 0x0, 0x0, 0x3f847ae147ae147b, 0x1000, 0xc000852200, 0x1, ...})
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:308 +0x272
github.com/dgraph-io/badger/v3.newLevelsController.func1({0xc0004a60c0, 0x13}, {0x0?, 0x0?, 0x0?})
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/levels.go:150 +0x1f9
created by github.com/dgraph-io/badger/v3.newLevelsController
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/levels.go:129 +0x585
Additional information
On Windows.
Our production DB is dead so often that we take a full backup every 5min.
Then attempt to recover and restore if needed.
Or keys are lost, and we don't know that.
Can't capture the error with recover(). How is that possible?
The text was updated successfully, but these errors were encountered: