-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot Start Redpanda After Upgrading from 22.2.7 to 22.3.2 (or 22.3.1) #7343
Comments
Just to be clear, since that but report is quite long:
|
@kargh : Thanks for the report, we are looking into it. Also, presumably this is a 1-node cluster -- is that right? |
It is a 10 node cluster.
I didn't have much of a choice as far as downgrading, I wanted to get the
node back online.
…On Wed, Nov 16, 2022, 5:35 PM piyushredpanda ***@***.***> wrote:
@kargh <https://github.com/kargh> : Thanks for the report, we are looking
into it.
To clarify though, we do not support a downgrade.
Also, presumably this is a 1-node cluster -- is that right?
—
Reply to this email directly, view it on GitHub
<#7343 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3DY7A2T6LXPRS4H7NY4E3WIVVT7ANCNFSM6AAAAAASCWSIZY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@piyushredpanda - i don't think that matters because the cluster never actually upgraded, so the global barrier shouldn't come into play. |
@emaxerrno, yes @kargh also mentioned they downgraded the cluster:
which is what I was trying to clarify we do not support, for future reference. |
Just to clarify the signals here, @kargh do you know roughly when you started upgrading the first node? Looks like Nov 16 15:35:47 is when things started going bad with the upgrade, but looks like there's a bit more segfaulting that had been happening a few days ago. How many topics and partitions are running on this cluster? How long was this node down before being upgraded? Do you see similar symptoms on other nodes? Also, do you happen to have the logs starting from before the node started going bad? |
@kargh - for the errors you show immediately after you write "From dmesg:": on what repdanda version did those occur? |
@travisdowns Yeah, looking at the timestamps, some of those are from prior to the upgrade. That would have been on v22.2.7. The opcode errors would have been from v22.3.2. |
@andrwng That error message happened right after I upgraded the node and tried to start it -- within 5 minutes of running the yum update command. The original issue Alex was helping me with was:
He had suggested I upgrade to 22.3.2 because there was a memory issue bug fixed recently. TOPICS 53 The node wasn't down when I upgraded. I see the above warning (seastar - Exceptional future ignored) on other nodes. That is what I was trying to resolve across the cluster. I just have the journal logs from other boxes. I've not tried to upgrade them at this point. I prefer to resolve this problem first before encountering it on other nodes. |
Upgraded to 22.3.3 and here is the full log from when I started the service until it failed:
|
Also, here are the logs from 22.2.7 from when I issued the
|
Is it safe to do a yum uninstall then reinstall? I assume the config files and such will remain in place. |
Now the entire cluster is unstable (I've not changed anything). This is in dmesg:
|
@kargh - this |
|
@kargh can you decomision one node since the cluster is unstable and run an FIO test on the same device to get a hardware profile (16KiB pages, seq read,write, 10G file) wondering what FIO latencies look like. |
@emaxerrno Can you provide me quick and dirty instructions on that? It has been a long time since I played around with that. |
|
@emaxerrno I found the stuff I need for the fio test. |
@emaxerrno Running this. Let me know if you want a specific engine or anything. I can restart it with different parameters: |
|
The remaining nodes in the cluster appear to have stabilized -- at least they have stopped crashing and restarting. We have noticed that the last two times we have pushed out updates to our application, it has had connection issues with RP. The app does a reload which creates new connections, I believe, to all the services. It appears the crash/restart cycle might have started when the update was pushed out. Maybe the nodes are getting smashed with connection requests. We'll be investigating that on our end. |
@kargh thanks for all these notes.... this is even MORE suspicious if that is the case. hmmmm... any insight ont he app behavior. btw eng team has today and tomorrow off, so they'll be online next week. |
this is definitely a bug w/ redpanda @kargh - any 'demo' app we can run on our CI to test? seems like the app is able to trigger a bug somewhere. maybe compression buffers, maybe large payloads.... anything you can help us get to the bottom would be helpful. |
last q @kargh what is the CPU to memory ratio, is it at least 2G/core? |
@emaxerrno 4G/core. 64 cores (logical processors are disabled). 256GB memory per node. What is odd is that the connection issue persists until we reload the app a second time. Then everything seems to run fine. It almost might be the volume of servers that get updated that could be causing the issue. That said, it looks like all the nodes might have crashed at the same time. Looking at the memory graphs, they all go from about 75% usage to 25% usage at 14:02. The app release was finished at 13:52. As far as a demo app, I'm not really sure what I could provide. I'll have to find out. Problem is that it wouldn't actually run without all the required services. I'm hoping it remains stable over the weekend or even through next week. |
The app does create new connections. The old processes and their connections die off and new ones are spawned. We've recently moved from a lot of really small bidders (2x10, 2x14, 2x18 cores boxes) to a large amount of 1x64 core boxes that handle a much larger volume of traffic. During out updates we were updating 15 servers at a time. It is possible that the volume of newly created connections was too much. We are going to change it from 15 to 5. The thing is, I'm only seeing about 170 connections per RP node from a single bidder. Even if 15 of them kill/create new connections at a time that shouldn't be enough to cause an issue or instability. |
this sounds ideal tbh. so definitely a bug on our side. let us dig next week. |
Just a bit more info. The application is just a nodejs app using node-rdkafka to handle kafka duties. nodejs v16.15.0 So nothing home baked for connection management or communicating with the cluster. |
@kargh - can you provide the exact version of repdanda used when this output occurred? Sorry there a few different versions involved so I just want to be clear since to decode the backtrace we need version-specific debug into. |
Based on the backtrace you provided in slack, you were seeing OOMs due to #6854, which is fixed in 22.3. So that's a good reason to upgrade. Can you tell us a bit about the topic structure? E.g., are topics created and deleted often? Does it happen that a topic receives a lot of traffic for a period then later almost no traffic (but is not deleted)? Those would lead to the type of OOM you saw. |
@travisdowns Topics are rarely created and never deleted. There are some topics that receive a bunch of traffic once or twice a week (this is something new that they added recently). I would like to upgrade but unfortunately RP won't start when I try. |
22.2.7 |
@travisdowns Just so we are on the same page, what is your definition of 'a lot of traffic'? |
BTW, I know that 22.3.2 will run on these servers. I built a 3 node cluster in another DC on Tuesday with that version. Granted, it is not in production and not taking any traffic yet but RP did start and I have created all the topics. So while there could be some issues with the AMD CPU (which wouldn't surprise me considering I've ran into other issues with these chips) but it is not likely the sole issue that I'm having with the production cluster. |
Another thing of note. The server I was upgrading was node 0. It is the seed server for the rest of the cluster. I'm not sure if that has any bearing on the upgrade issues I'm experiencing. |
Looks like #7351 might fix the 22.3.1 upgrade issue. I'll be ready to test that as soon as it is merged. |
About 10 GB or more. I.e., if those topics which only receive traffic once a week or so get more than that (even added up over several weeks) it could trigger the out of memory issue you saw. |
It may. It definitely causes a similar problem. However, since it is race condition it usually only triggers intermittently, but in your case I understood that you see a failure to start every time. It is possible that the race condition simply occurs every time (especially since it seems like you have 64 cores, increasing the chance of this occurring), but it is also possible that it's a different issue related to the state of your existing logs, which didn't trigger an assert previously (but still existed) but which now triggers an assert. |
Yup, I got it: just wanted to provide also some background on the OOM issue which drove the upgrade attempt in the first place. |
Ah, nothing close to that. Jumped to about 650MB total for the entire cluster. |
Only one way to find out! FYI - It is always the same assert message:
Always core 27 in domain 0 at node { node: 8... |
@kargh is there only a single common host being used here, or is the "upgrade" happening on a different host, possibly with a different number of cores? |
@travisdowns the upgrade is only happening on the same host (node 0). I've not tried upgrading any other host. All hosts are identical servers. Same number of cores, memory, drives, etc. |
And 22.2.7 is crashing again. |
Every node is now in an endless crash loop. The logs are getting spammed with things such as this:
|
|
Just an fyi - I had to blow away the cluster and all data and start over on 22.3.3. All the nodes started crashing Saturday evening. I wasn't able to keep any of them up for more than 15 seconds. Hopefully whatever perfect storm of issues hit us doesn't happen with 22.3.3. |
v22.3.4 had two fixes for this area of code, so I think we can close this one out. |
Version & Environment
Redpanda version: (use
rpk version
):Current: 22.2.7
Upgraded to 22.3.2 then downgraded to 22.3.1, both have issues.
OS: CentOS 7.9
Kernel: 5.19.8-1.el7.elrepo.x86_64
What went wrong?
After upgrading to 22.3.2, I observed the following errors in the journal. App would not start:
I then downgraded from 22.3.2 to 22.3.1 and attempted to start the app again. This was in the log:
Downgraded again, back on 22.2.7 and while that started, I did see this in the log:
What should have happened instead?
Application should start after the upgrade.
How to reproduce the issue?
Additional information
I suspect you won't see the same issues that I am experiencing but hopefully there is enough details to help track it down. I was asked to upgrade because I was observing the following issues originally:
From dmesg:
The text was updated successfully, but these errors were encountered: