Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Lambda timeouts and PHP-FPM crashes better #862

Closed
aran112000 opened this issue Feb 10, 2021 · 13 comments
Closed

Handle Lambda timeouts and PHP-FPM crashes better #862

aran112000 opened this issue Feb 10, 2021 · 13 comments
Labels

Comments

@aran112000
Copy link
Contributor

One of our Lambda functions running in eu-west-2 started getting intermittant errors logging in Cloudwatch stating:
"Error: signal: terminated Extension.Crash"

When this occurs, none of our application code is executed and the request just fails after ~2 seconds with API Gateway then returning a 500 error on the affected requests.

Our function only runs 3 layers and has done so for months, the issues started for 2 hours on 2021-02-08, then again around the same time in the evening last night 2021-02-09:

  • ${bref:layer.php-74-fpm}
  • ${bref:extra.memcached-php-74}
  • ${bref:extra.redis-php-74}

For clarity, there have been no changes to the application or environment by ourselves during these times and we have no time-based logic that would account for the issues starting in the evening for a few hours two days running.

Not sure if this is a Bref / Bref Extra issue or an AWS one, the Lambda team have investigated and believe it is to do with the layers, so any help/guideance to track things down further would be greatly appreciated.

@aran112000 aran112000 added the bug label Feb 10, 2021
@ghost
Copy link

ghost commented Feb 10, 2021

We've managed to track down the issue to PHP-FPM having issues as can be seen from an extract from the logs below, has anyone encountered this before? For us, we've had it the last two nights only then it seems to eventually resolve itself but affecting over 1k invocations sporadically.

image

@matthewsuan
Copy link

@evoaran we are experiencing this issue as well. It fixes itself i think when lambda spawn another micro container instance.

@mnapoli
Copy link
Member

mnapoli commented Feb 12, 2021

Hi! I haven't forgotten about this, I'm just short of ideas on what to do here. FYI there's also #842 that could be related, it may be worth giving it a look.

it seems to eventually resolve itself but affecting over 1k invocations sporadically.

Right, as @matthewsuan mentioned this might be a case of the Lambda instance that is in a "broken" state is replaced by another Lambda instance.

Now the questions would be:

  • what causes Lambda/PHP-FPM to enter a broken state?
  • how can we improve Bref so that we never enter a broken state? (i.e. OK to have a failed request, but not the whole container)

Judging from #842 and past experience, could the initial failures be triggered by a large file upload or large HTTP response?

@aran112000
Copy link
Contributor Author

Hi @mnapoli, @matthewsuan,

After combing through the logs again this evening, I think I've found the cause that starts it each time - a Lambda timeout:

Task timed out after 28.53 seconds

What I'm still not sure of though, is what caused the initial timeout, it's just a normal HTTP request and our dependant services for this request example (Redis, Memcached & MySQL RDS) are all running perfectly at this time and have connection timeouts low enough we'd never reach this timeout.

To address your specific question, we're seeing a mix of GET and POST requests starting this issue and we don't process any file uploads via Lambda so can rule that out too. HTTP responses aren't "large", 45kb for the largest responses we're seeing being affected.

Could it just be the Lambda host has died perhaps?

On a slightly different topic, would you be happy to accept a PR to change the Cloudwatch log format for Bref to convert these FPM related messages to include a JSON payload which can have more context details (Lambda Request ID and API Gateway request ID - where applicable) as this would really help to try and tie various log entries together when issues like this occur.

@matthewsuan
Copy link

I went and checked logs and these same errors happen during occasional timeouts (28 secs) as well though I expect timeouts from time to time with my application.

I think when this happens, FPM runtime was not able to properly shutdown during Shutdown phase which is only capped at 2 secs before lambda forces sigkill.

@mnapoli
Copy link
Member

mnapoli commented Feb 13, 2021

Thanks for those details! So we might need to adjust the current recovery mechanism in case of Lambda timeouts:

// Never seen this happen but just in case
if (! file_exists(self::PID_FILE)) {
unlink(self::SOCKET);
return;
}
$pid = (int) file_get_contents(self::PID_FILE);
// Never seen this happen but just in case
if ($pid <= 0) {
echo "PHP-FPM's PID file contained an invalid PID, assuming PHP-FPM isn't running.\n";
unlink(self::SOCKET);
unlink(self::PID_FILE);
return;
}
// Check if the process is running
if (posix_getpgid($pid) === false) {
// PHP-FPM is not running anymore, we can cleanup
unlink(self::SOCKET);
unlink(self::PID_FILE);
return;
}
// The PID could be reused by our new process: let's not kill ourselves
// See https://github.com/brefphp/bref/pull/645
if ($pid === posix_getpid()) {
unlink(self::SOCKET);
unlink(self::PID_FILE);
return;
}
echo "PHP-FPM seems to be running already. This might be because Lambda stopped the bootstrap process but didn't leave us an opportunity to stop PHP-FPM (did Lambda timeout?). Stopping PHP-FPM now to restart from a blank slate.\n";
// The previous PHP-FPM process is running, let's try to kill it properly
$result = posix_kill($pid, self::SIGTERM);
if ($result === false) {
echo "PHP-FPM's PID file contained a PID that doesn't exist, assuming PHP-FPM isn't running.\n";
unlink(self::SOCKET);
unlink(self::PID_FILE);
return;
}
$this->waitUntilStopped($pid);
unlink(self::SOCKET);
unlink(self::PID_FILE);

Maybe we need to send a stronger "kill" signal in case we can kill PHP-FPM here:

throw new Exception('Timeout while waiting for PHP-FPM to stop');

Any other idea?

On a slightly different topic, would you be happy to accept a PR to change the Cloudwatch log format for Bref to convert these FPM related messages to include a JSON payload which can have more context details (Lambda Request ID and API Gateway request ID - where applicable) as this would really help to try and tie various log entries together when issues like this occur.

Do you mean to change the format of all the logs written by PHP applications? Or specifically the logs written by Bref when managing PHP-FPM?

In short, I'm 👍 on improving Bref-related exceptions and logs. But changing the formats of all application logs to JSON might be a too big of a change.

@ghost
Copy link

ghost commented Feb 13, 2021

Completely agree, perhaps we look at swapping out this...

$result = posix_kill($pid, self::SIGTERM);

...to use KILL (9) rather than TERM (15) to ensure it actually terminates regardless of what might be going on behind the scenes.

Regarding the logs, I was purely thinking about the FPM-related logging to help track down what's going on here more easily so I'll take a look at a PR to add this in shortly.

@aran112000
Copy link
Contributor Author

Do you think it might be worthwhile stipulating a timeout on FPM that's a little lower than the Lambda timeout in order to allow a little extra time for everything to wrap up successfully?

https://github.com/brefphp/bref/blob/master/src/Event/Http/FpmHandler.php#L79

@deleugpn
Copy link
Member

deleugpn commented Feb 14, 2021

Pasting this here just because fiddling around with timeout was mentioned and is extremely tricky

#770
#772

@mnapoli mnapoli changed the title Error: signal: terminated Extension.Crash Handle Lambda timeouts and PHP-FPM crashes better Mar 25, 2021
@clarkeash
Copy link

I think we are seeing this issue. We have it where our uptime check fails (within 2 seconds), this seems to occur when other requests have timed out so I figure it's the same issue where php-fpm is in a bad state when the runtime is being reused by our uptime ping request.

Has anyone been able to work around this issue?

@aran112000
Copy link
Contributor Author

We still see it following a timeout (normally after RDS Serverless fails to scale)

mnapoli added a commit that referenced this issue Jan 4, 2022
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug.

With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection).

If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which:

- flushes the logs (logs end up in CloudWatch, which is great)
- restarts a clean FPM worker, without doing a full FPM restart (which may take longer)

Follow up of #770, #772, #895

May address some of #862

Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
mnapoli added a commit that referenced this issue Jan 4, 2022
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug.

With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection).

If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which:

- flushes the logs (logs end up in CloudWatch, which is great)
- restarts a clean FPM worker, without doing a full FPM restart (which may take longer)

Follow up of #770, #772, #895

May address some of #862

Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
mnapoli added a commit that referenced this issue Jan 4, 2022
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug.

With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection).

If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which:

- flushes the logs (logs end up in CloudWatch, which is great)
- restarts a clean FPM worker, without doing a full FPM restart (which may take longer)

Follow up of #770, #772, #895

May address some of #862

Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
@mnapoli
Copy link
Member

mnapoli commented Jan 4, 2022

This PR may help to handle timeouts with FPM better, including flushing all logs (and possibly restarting the child FPM worker in a cleaner way): #1133

@mnapoli
Copy link
Member

mnapoli commented Dec 1, 2022

Since its been about a year since the last reports, and we have merged #1133 too, I'll go ahead and close this one.

@mnapoli mnapoli closed this as completed Dec 1, 2022
mnapoli added a commit that referenced this issue Feb 14, 2023
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug.

With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection).

If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which:

- flushes the logs (logs end up in CloudWatch, which is great)
- restarts a clean FPM worker, without doing a full FPM restart (which may take longer)

Follow up of #770, #772, #895

May address some of #862

Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants