Handle Lambda timeouts and PHP-FPM crashes better #862

aran112000 · 2021-02-10T10:00:36Z

One of our Lambda functions running in eu-west-2 started getting intermittant errors logging in Cloudwatch stating:
"Error: signal: terminated Extension.Crash"

When this occurs, none of our application code is executed and the request just fails after ~2 seconds with API Gateway then returning a 500 error on the affected requests.

Our function only runs 3 layers and has done so for months, the issues started for 2 hours on 2021-02-08, then again around the same time in the evening last night 2021-02-09:

${bref:layer.php-74-fpm}
${bref:extra.memcached-php-74}
${bref:extra.redis-php-74}

For clarity, there have been no changes to the application or environment by ourselves during these times and we have no time-based logic that would account for the issues starting in the evening for a few hours two days running.

Not sure if this is a Bref / Bref Extra issue or an AWS one, the Lambda team have investigated and believe it is to do with the layers, so any help/guideance to track things down further would be greatly appreciated.

ghost · 2021-02-10T17:30:22Z

We've managed to track down the issue to PHP-FPM having issues as can be seen from an extract from the logs below, has anyone encountered this before? For us, we've had it the last two nights only then it seems to eventually resolve itself but affecting over 1k invocations sporadically.

matthewsuan · 2021-02-11T01:14:38Z

@evoaran we are experiencing this issue as well. It fixes itself i think when lambda spawn another micro container instance.

mnapoli · 2021-02-12T17:21:17Z

Hi! I haven't forgotten about this, I'm just short of ideas on what to do here. FYI there's also #842 that could be related, it may be worth giving it a look.

it seems to eventually resolve itself but affecting over 1k invocations sporadically.

Right, as @matthewsuan mentioned this might be a case of the Lambda instance that is in a "broken" state is replaced by another Lambda instance.

Now the questions would be:

what causes Lambda/PHP-FPM to enter a broken state?
how can we improve Bref so that we never enter a broken state? (i.e. OK to have a failed request, but not the whole container)

Judging from #842 and past experience, could the initial failures be triggered by a large file upload or large HTTP response?

aran112000 · 2021-02-12T23:01:22Z

Hi @mnapoli, @matthewsuan,

After combing through the logs again this evening, I think I've found the cause that starts it each time - a Lambda timeout:

Task timed out after 28.53 seconds

What I'm still not sure of though, is what caused the initial timeout, it's just a normal HTTP request and our dependant services for this request example (Redis, Memcached & MySQL RDS) are all running perfectly at this time and have connection timeouts low enough we'd never reach this timeout.

To address your specific question, we're seeing a mix of GET and POST requests starting this issue and we don't process any file uploads via Lambda so can rule that out too. HTTP responses aren't "large", 45kb for the largest responses we're seeing being affected.

Could it just be the Lambda host has died perhaps?

On a slightly different topic, would you be happy to accept a PR to change the Cloudwatch log format for Bref to convert these FPM related messages to include a JSON payload which can have more context details (Lambda Request ID and API Gateway request ID - where applicable) as this would really help to try and tie various log entries together when issues like this occur.

matthewsuan · 2021-02-13T09:31:56Z

I went and checked logs and these same errors happen during occasional timeouts (28 secs) as well though I expect timeouts from time to time with my application.

I think when this happens, FPM runtime was not able to properly shutdown during Shutdown phase which is only capped at 2 secs before lambda forces sigkill.

mnapoli · 2021-02-13T19:31:25Z

Thanks for those details! So we might need to adjust the current recovery mechanism in case of Lambda timeouts:

bref/src/Event/Http/FpmHandler.php

Lines 214 to 258 in e94df97

    
           // Never seen this happen but just in case 
        
           if (! file_exists(self::PID_FILE)) { 
        
               unlink(self::SOCKET); 
        
               return; 
        
           } 
        
           $pid = (int) file_get_contents(self::PID_FILE); 
        
           // Never seen this happen but just in case 
        
           if ($pid <= 0) { 
        
               echo "PHP-FPM's PID file contained an invalid PID, assuming PHP-FPM isn't running.\n"; 
        
               unlink(self::SOCKET); 
        
               unlink(self::PID_FILE); 
        
               return; 
        
           } 
        
           // Check if the process is running 
        
           if (posix_getpgid($pid) === false) { 
        
               // PHP-FPM is not running anymore, we can cleanup 
        
               unlink(self::SOCKET); 
        
               unlink(self::PID_FILE); 
        
               return; 
        
           } 
        
           // The PID could be reused by our new process: let's not kill ourselves 
        
           // See https://github.com/brefphp/bref/pull/645 
        
           if ($pid === posix_getpid()) { 
        
               unlink(self::SOCKET); 
        
               unlink(self::PID_FILE); 
        
               return; 
        
           } 
        
           echo "PHP-FPM seems to be running already. This might be because Lambda stopped the bootstrap process but didn't leave us an opportunity to stop PHP-FPM (did Lambda timeout?). Stopping PHP-FPM now to restart from a blank slate.\n"; 
        
           // The previous PHP-FPM process is running, let's try to kill it properly 
        
           $result = posix_kill($pid, self::SIGTERM); 
        
           if ($result === false) { 
        
               echo "PHP-FPM's PID file contained a PID that doesn't exist, assuming PHP-FPM isn't running.\n"; 
        
               unlink(self::SOCKET); 
        
               unlink(self::PID_FILE); 
        
               return; 
        
           } 
        
           $this->waitUntilStopped($pid); 
        
           unlink(self::SOCKET); 
        
           unlink(self::PID_FILE);

Maybe we need to send a stronger "kill" signal in case we can kill PHP-FPM here:

bref/src/Event/Http/FpmHandler.php

Line 273 in e94df97

throw new Exception('Timeout while waiting for PHP-FPM to stop');

Any other idea?

On a slightly different topic, would you be happy to accept a PR to change the Cloudwatch log format for Bref to convert these FPM related messages to include a JSON payload which can have more context details (Lambda Request ID and API Gateway request ID - where applicable) as this would really help to try and tie various log entries together when issues like this occur.

Do you mean to change the format of all the logs written by PHP applications? Or specifically the logs written by Bref when managing PHP-FPM?

In short, I'm 👍 on improving Bref-related exceptions and logs. But changing the formats of all application logs to JSON might be a too big of a change.

ghost · 2021-02-13T23:40:08Z

Completely agree, perhaps we look at swapping out this...

bref/src/Event/Http/FpmHandler.php

Line 249 in e94df97

$result = posix_kill($pid, self::SIGTERM);

...to use KILL (9) rather than TERM (15) to ensure it actually terminates regardless of what might be going on behind the scenes.

Regarding the logs, I was purely thinking about the FPM-related logging to help track down what's going on here more easily so I'll take a look at a PR to add this in shortly.

aran112000 · 2021-02-14T00:43:41Z

Do you think it might be worthwhile stipulating a timeout on FPM that's a little lower than the Lambda timeout in order to allow a little extra time for everything to wrap up successfully?

https://github.com/brefphp/bref/blob/master/src/Event/Http/FpmHandler.php#L79

deleugpn · 2021-02-14T08:00:48Z

Pasting this here just because fiddling around with timeout was mentioned and is extremely tricky

#770
#772

clarkeash · 2021-11-25T16:16:26Z

I think we are seeing this issue. We have it where our uptime check fails (within 2 seconds), this seems to occur when other requests have timed out so I figure it's the same issue where php-fpm is in a bad state when the runtime is being reused by our uptime ping request.

Has anyone been able to work around this issue?

aran112000 · 2021-11-25T17:26:50Z

We still see it following a timeout (normally after RDS Serverless fails to scale)

When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.

mnapoli · 2022-01-04T16:21:34Z

This PR may help to handle timeouts with FPM better, including flushing all logs (and possibly restarting the child FPM worker in a cleaner way): #1133

mnapoli · 2022-12-01T17:27:44Z

Since its been about a year since the last reports, and we have merged #1133 too, I'll go ahead and close this one.

When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.

aran112000 added the bug label Feb 10, 2021

mnapoli mentioned this issue Feb 24, 2021

FastCgiCommunicationFailed on every second request #842

Open

mnapoli changed the title ~~Error: signal: terminated Extension.Crash~~ Handle Lambda timeouts and PHP-FPM crashes better Mar 25, 2021

Nyholm mentioned this issue Mar 25, 2021

Handle timeouts more gracefully by allowing the application to shutdown #895

Open

2 tasks

mnapoli mentioned this issue Jan 4, 2022

Flush FPM logs in case of timeouts #1133

Merged

mnapoli closed this as completed Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Lambda timeouts and PHP-FPM crashes better #862

Handle Lambda timeouts and PHP-FPM crashes better #862

aran112000 commented Feb 10, 2021

ghost commented Feb 10, 2021

matthewsuan commented Feb 11, 2021

mnapoli commented Feb 12, 2021

aran112000 commented Feb 12, 2021

matthewsuan commented Feb 13, 2021

mnapoli commented Feb 13, 2021

ghost commented Feb 13, 2021

aran112000 commented Feb 14, 2021

deleugpn commented Feb 14, 2021 •

edited

Loading

clarkeash commented Nov 25, 2021

aran112000 commented Nov 25, 2021

mnapoli commented Jan 4, 2022

mnapoli commented Dec 1, 2022

Handle Lambda timeouts and PHP-FPM crashes better #862

Handle Lambda timeouts and PHP-FPM crashes better #862

Comments

aran112000 commented Feb 10, 2021

ghost commented Feb 10, 2021

matthewsuan commented Feb 11, 2021

mnapoli commented Feb 12, 2021

aran112000 commented Feb 12, 2021

matthewsuan commented Feb 13, 2021

mnapoli commented Feb 13, 2021

ghost commented Feb 13, 2021

aran112000 commented Feb 14, 2021

deleugpn commented Feb 14, 2021 • edited Loading

clarkeash commented Nov 25, 2021

aran112000 commented Nov 25, 2021

mnapoli commented Jan 4, 2022

mnapoli commented Dec 1, 2022

deleugpn commented Feb 14, 2021 •

edited

Loading