Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"table or event-trigger not found in schema cache" error in logs #5461

Closed
hrgui opened this issue Jul 24, 2020 · 18 comments · Fixed by #6137
Closed

"table or event-trigger not found in schema cache" error in logs #5461

hrgui opened this issue Jul 24, 2020 · 18 comments · Fixed by #6137
Labels
p/high candidate for being included in the upcoming sprint
Milestone

Comments

@hrgui
Copy link

hrgui commented Jul 24, 2020

Just upgraded from 1.2.2 to 1.3.0, noticed this following error message popping up more than once:

graphql-engine {"type":"event-trigger","timestamp":"2020-07-24T23:37:43.617+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}

How do we go about debugging this error / what would be the cause? Hasura still functions normally, but the log message spams almost every second or minute.

@hrgui hrgui changed the title "table or event-trigger not found in schema cache" error shows up but does not mention what is not present in schema cache 1.3.0: "table or event-trigger not found in schema cache" error shows up but does not mention what is not present in schema cache Jul 24, 2020
@hrgui hrgui changed the title 1.3.0: "table or event-trigger not found in schema cache" error shows up but does not mention what is not present in schema cache "table or event-trigger not found in schema cache" error shows up but does not mention what is not present in schema cache Jul 25, 2020
@hrgui
Copy link
Author

hrgui commented Jul 25, 2020

Update:

DELETE FROM hdb_catalog.event_invocation_logs;

DELETE FROM hdb_catalog.event_log;

Query seems to help, I haven't seen the error messages anymore.

  • It seems a failed event-trigger (it has 3 retries) may cause such an issue to happen

@petecorreia
Copy link

petecorreia commented Jul 27, 2020

This has happened to us on multiple occasions specifically when we delete event triggers. Hasura seems to go crazy with it until we clear the event logs.

To diagnose I recommend the following:

select  trigger_name, archived, count(*) from hdb_catalog.event_log
group by trigger_name, archived;

There should be no event triggers listed that have been removed or look out of place.

If there are then clear them individually to avoid clearing out too much data that you might need from other event triggers. Replace <event_trigger_name>:

DELETE FROM hdb_catalog.event_invocation_logs
  WHERE event_id IN (
    SELECT id FROM hdb_catalog.event_log
    WHERE trigger_name = '<event_trigger_name>' );

DELETE FROM hdb_catalog.event_log
  WHERE trigger_name = '<event_trigger_name>';

These have fixed the issue for our case.

It'd be great for the Hasura team to understand what's really causing this but as you mentioned it seems related to infinite loops of retries or something.

@tirumaraiselvan
Copy link
Contributor

The error message definitely needs improvement.

This may happen when you have events in queue for an event trigger that you just deleted. But, it could also be some other race condition.

@tirumaraiselvan tirumaraiselvan added c/event-triggers p/high candidate for being included in the upcoming sprint labels Jul 29, 2020
@petecorreia
Copy link

@tirumaraiselvan Thank you clarifying!

Would you be looking at fixing the issue, improving the error message, or both?

We would have to prevent this from happening as Hasura creates a very high number of DB connections until it maxes out. We use Cloud SQL which limits and cuts off other valid connections. Ultimately this results in intermittent functionality failures which is affects app functionality.

We were looking at implementing a mitigation that detects the issue and then automatically clears out the affected event logs but we'll skip it if you're looking at a fix.

@tirumaraiselvan
Copy link
Contributor

tirumaraiselvan commented Jul 29, 2020

We would have to prevent this from happening as Hasura creates a very high number of DB connections until it maxes out. We use Cloud SQL which limits and cuts off other valid connections. Ultimately this results in intermittent functionality failures which is affects app functionality.

You can always limit the number of connections Hasura creates via HASURA_GRAPHQL_PG_CONNECTIONS (default: 50)

We will definitely investigate the cause of this error as part of improving the error message. Currently, it is not clear why it is happening. The only likely situation that I can think of is that the events were getting processed while the event trigger was dropped in which case it is rather benign.

@hrgui
Copy link
Author

hrgui commented Jul 29, 2020

This may happen when you have events in queue for an event trigger that you just deleted. But, it could also be some other race condition.

For me it was an event trigger that had retried 4 times and had failed all times.

@RodolfoSilva
Copy link
Contributor

RodolfoSilva commented Sep 7, 2020

@tirumaraiselvan this error still appears after updating to 1.3.2

2020-09-07T23:39:06.453708315Z {"type":"event-trigger","timestamp":"2020-09-07T23:39:06.086+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:39:07.454725277Z {"type":"event-trigger","timestamp":"2020-09-07T23:39:06.086+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:39:07.454784718Z {"type":"event-trigger","timestamp":"2020-09-07T23:39:06.086+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:40:06.592350211Z {"type":"event-trigger","timestamp":"2020-09-07T23:40:06.204+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:40:06.592429803Z {"type":"event-trigger","timestamp":"2020-09-07T23:40:06.204+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:40:07.593581597Z {"type":"event-trigger","timestamp":"2020-09-07T23:40:06.204+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:41:06.748399829Z {"type":"event-trigger","timestamp":"2020-09-07T23:41:06.327+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:41:06.748460254Z {"type":"event-trigger","timestamp":"2020-09-07T23:41:06.327+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:41:07.749948355Z {"type":"event-trigger","timestamp":"2020-09-07T23:41:06.327+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:42:06.893702941Z {"type":"event-trigger","timestamp":"2020-09-07T23:42:06.455+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:42:06.893762517Z {"type":"event-trigger","timestamp":"2020-09-07T23:42:06.455+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:42:07.895079037Z {"type":"event-trigger","timestamp":"2020-09-07T23:42:06.455+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:43:07.038807559Z {"type":"event-trigger","timestamp":"2020-09-07T23:43:06.597+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:43:07.038866520Z {"type":"event-trigger","timestamp":"2020-09-07T23:43:06.597+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:43:08.040231802Z {"type":"event-trigger","timestamp":"2020-09-07T23:43:06.597+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:44:07.180090676Z {"type":"event-trigger","timestamp":"2020-09-07T23:44:06.742+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:44:07.180130289Z {"type":"event-trigger","timestamp":"2020-09-07T23:44:06.742+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:44:08.181308236Z {"type":"event-trigger","timestamp":"2020-09-07T23:44:06.742+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:45:07.319800084Z {"type":"event-trigger","timestamp":"2020-09-07T23:45:06.867+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:45:07.319856215Z {"type":"event-trigger","timestamp":"2020-09-07T23:45:06.867+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:45:08.320483837Z {"type":"event-trigger","timestamp":"2020-09-07T23:45:06.867+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:46:07.526498516Z {"type":"event-trigger","timestamp":"2020-09-07T23:46:06.970+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:46:07.526577878Z {"type":"event-trigger","timestamp":"2020-09-07T23:46:06.970+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:46:08.528339850Z {"type":"event-trigger","timestamp":"2020-09-07T23:46:06.970+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:47:07.691373407Z {"type":"event-trigger","timestamp":"2020-09-07T23:47:07.107+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:47:07.691451154Z {"type":"event-trigger","timestamp":"2020-09-07T23:47:07.107+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:47:08.692429740Z {"type":"event-trigger","timestamp":"2020-09-07T23:47:07.107+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:48:07.846698088Z {"type":"event-trigger","timestamp":"2020-09-07T23:48:07.246+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:48:08.848335482Z {"type":"event-trigger","timestamp":"2020-09-07T23:48:07.246+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:48:08.848402382Z {"type":"event-trigger","timestamp":"2020-09-07T23:48:07.246+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:49:07.999542997Z {"type":"event-trigger","timestamp":"2020-09-07T23:49:07.387+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:49:09.000834633Z {"type":"event-trigger","timestamp":"2020-09-07T23:49:07.387+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:49:09.000902762Z {"type":"event-trigger","timestamp":"2020-09-07T23:49:07.387+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:50:08.172744735Z {"type":"event-trigger","timestamp":"2020-09-07T23:50:07.501+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:50:09.174826859Z {"type":"event-trigger","timestamp":"2020-09-07T23:50:07.501+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}
2020-09-07T23:50:09.174894688Z {"type":"event-trigger","timestamp":"2020-09-07T23:50:07.501+0000","level":"error","detail":{"path":"$","error":"table or event-trigger not found in schema cache","code":"unexpected"}}

@tirumaraiselvan
Copy link
Contributor

Unfortunately, the improved error msg in #5718 didn't get included in v1.3.2. We will definitely try to incorporate it in the next one.

Meanwhile, pls check the comment #5461 (comment) for getting to know the offending event trigger.

@petecorreia
Copy link

petecorreia commented Sep 30, 2020

@tirumaraiselvan We've noticed a more serious issue related to these errors. It's now clear that these errors are leading to an overload of DB connections from Hasura.

It can lead to failures in deployments as a new instance can struggle to connect to the DB, getting rejected due to max connections. We use kubernetes with replicated instances and rolling updates so we run into this often unfortunately.

You can see evidence of this in these two charts:

Screenshot 2020-09-30 at 14 38 44

Screenshot 2020-09-30 at 14 38 58

DB Connections match up perfectly to the amount of these errors per second.

The gap in the middle is when we cleared up the errors as mentioned in previous comment (#5461)

Improving the logs is definitely a step in the right direction but ideally this would be fixed?

@tirumaraiselvan
Copy link
Contributor

@petecorreia Thanks for adding more details. We are investigating this.

@jonlow
Copy link

jonlow commented Oct 3, 2020

Just flagging that this is happened to us to after an upgrade. The log file flooded with gigabytes of the same event-trigger error and crashed the server.

After clearing logs as per #5461 (comment) and manually deleting the log file things seem ok for now.

@smulube
Copy link

smulube commented Oct 5, 2020

Hi @tirumaraiselvan - I work with @petecorreia and I have a little more detail to this issue which might help you track things down; specifically I think there is perhaps two separate but related bugs here that combine to cause this issue.

Before detailing the two issues we found, a little context on our setup:

  • we have two Hasura replicas connected to the same Cloud SQL DB
  • we use the cli-migrations/v2 image when deploying updates to run migrations and to update metadata because we only enable the graphql API for security purposes.
  • our Hasura instances are currently logging ~30 error messages per second caused by invalid event triggers and maxing out available connections to the DB which causes other processes to fail

Issue 1 - Metadata not applying properly

On investigating our DB which is currently suffering from this issue via psql we noticed that a set of triggers that should no longer exist were still present on some tables despite no longer being in the metadata files. It's not clear to us how this has happened. The process we followed to remove the triggers was to remove them via the Hasura console, commit the changes to tables.yml and then deploy an updated docker image containing the updated metadata using the cli-migrations/v2 which theoretically should apply that metadata before starting Hasura.

It seems at some point this metadata application on deploy failed, and then once this had happened Hasura wasn't able to detect that the triggers had not been removed (I understand this may be difficult because there may also be user created triggers within the DB).

So at this point we have the unfortunate situation that Postgres still has triggers attached to a table, but from Hasura's perspective these triggers no longer exist.

Issue 2 - Event invocations for missing triggers does not update tries on the hdb_catalog.event_log table

When inspecting the event_log table, looking at some of the invalid calls I noticed the following issue: when Hasura sees an event for an invalid trigger it does not seem to update the tries field on the event_log table. This means that each event never hit the max_retries value so Hasura keeps trying these bogus events forever.

-------------------------------------------------------------------------------------------------------------
id            | 6fe105d7-085e-447c-ab15-5a7c7eee9125
schema_name   | public
table_name    | <redacted_table_name>
trigger_name  | <redacted-event-trigger-name>
payload       | {"op": "DELETE", "data": {...}": {"x-hasura-role": "admin"}}
delivered     | f
error         | f
tries         | 0
created_at    | 2020-10-05 13:00:22.745178
locked        | f
next_retry_at | 2020-10-05 15:50:23.507937
archived      | f

In this example entry from the hdb_catalog.event_table you can see that this event was created at 13:00 and Hasura has been retrying this event every minute since then, but tries remains at 0.

I also note that the error field is false when perhaps that should also be true.

Thanks in advance for your help

@hrgui
Copy link
Author

hrgui commented Oct 8, 2020

@smulube

Issue 2 - Event invocations for missing triggers does not update tries on the hdb_catalog.event_log table

Thanks for the insight, I also checked event_log table on my end and I have the same problem. The number of logs show up drastically with the number of events that failed to deliver.

@smulube
Copy link

smulube commented Oct 9, 2020

@hrgui - we have attempted to mitigate this by adding a manual migration to remove any invalid triggers along the lines of:

DROP TRIGGER IF EXISTS "notify_hasura_table_name-change_UPDATE" ON table_name CASCADE;
DROP TRIGGER IF EXISTS "notify_hasura_table_name-change_INSERT" ON table_name CASCADE;
DROP TRIGGER IF EXISTS "notify_hasura_table_name-change_DELETE" ON table_name CASCADE;

where those trigger names come from using psql to inspect each table and identify any triggers that should no longer exist.

We also added this function to delete invalid triggers:

CREATE OR REPLACE FUNCTION purge_invalid_events() RETURNS void AS $$
BEGIN
    WITH valid_trigger_names AS (
        SELECT DISTINCT name FROM hdb_catalog.event_triggers
    ), invalid_event_ids AS (
        SELECT id FROM hdb_catalog.event_log
        WHERE trigger_name NOT IN (SELECT * FROM valid_trigger_names)
    ), delete_event_invocations AS (
        DELETE FROM hdb_catalog.event_invocation_logs
        WHERE event_id IN (SELECT * FROM invalid_event_ids)
    )
    DELETE FROM hdb_catalog.event_log
    WHERE trigger_name NOT IN (SELECT * FROM valid_trigger_names);
END;
$$ LANGUAGE plpgsql;

which we call daily via a cron job. We're hoping this will mitigate the problem for the time being but hopefully there will be some cleaner way of fixing this than this workaround as we'll continue to need to check the database for any dangling invalid triggers.

@tirumaraiselvan
Copy link
Contributor

I have been able to reproduce this, finally. Thanks for your details and patience.

Issue 2 - Event invocations for missing triggers does not update tries on the hdb_catalog.event_log table

This was done by design so as to not fail silently for an event which hitherto was known to be caused only for transient issues. We will see how we can improve this as well.

@tirumaraiselvan
Copy link
Contributor

Basically, replace_metadata is broken when dropping event triggers. Repro:

  1. create table test(id serial primary key, name text);
  2. Initial metadata:
{ "type": "replace_metadata",
  "args":
{
  "version": 2,
  "tables": [
    {
      "table": {
        "schema": "public",
        "name": "test"
      },
      "event_triggers": [
        {
          "name": "test_trigger",
          "definition": {
            "enable_manual": false,
            "insert": {
              "columns": "*"
            }
          },
          "retry_conf": {
            "num_retries": 0,
            "interval_sec": 10,
            "timeout_sec": 60
          },
          "webhook": "http://httpbin.org/post"
        }
      ]
    }
  ]
}
}
  1. Final metadata:
{ "type": "replace_metadata",
  "args":
{
  "version": 2,
  "tables": [
    {
      "table": {
        "schema": "public",
        "name": "test"
      }
  }
  ]
}
}
  1. If you insert into test, then you can observe the errors. You can also see the trigger functions still exist in hdb_views schema.

@tirumaraiselvan tirumaraiselvan added this to the v1.3.3 milestone Nov 3, 2020
@tirumaraiselvan tirumaraiselvan changed the title "table or event-trigger not found in schema cache" error shows up but does not mention what is not present in schema cache "table or event-trigger not found in schema cache" error in logs Nov 4, 2020
lexi-lambda added a commit to lexi-lambda/graphql-engine that referenced this issue Nov 5, 2020
lexi-lambda added a commit to lexi-lambda/graphql-engine that referenced this issue Nov 5, 2020
tirumaraiselvan added a commit to lexi-lambda/graphql-engine that referenced this issue Nov 10, 2020
kodiakhq bot added a commit to lexi-lambda/graphql-engine that referenced this issue Nov 10, 2020
@kodiakhq kodiakhq bot closed this as completed in #6137 Nov 10, 2020
@tirumaraiselvan
Copy link
Contributor

tirumaraiselvan commented Nov 12, 2020

Hey folks, while we roll out a fix for this in the next few days (v1.3.3). You can run this query to clear any invalid postgres triggers:

curl -d'{"type": "run_sql", "args": { "sql": "select 1", "check_metadata_consistency": true}}' localhost:8080/v1/query

Note that you may need to clear the events which might have been created due to these invalid triggers:

    WITH valid_trigger_names AS (
        SELECT DISTINCT name FROM hdb_catalog.event_triggers
    ), invalid_event_ids AS (
        SELECT id FROM hdb_catalog.event_log
        WHERE trigger_name NOT IN (SELECT * FROM valid_trigger_names)
    ), delete_event_invocations AS (
        DELETE FROM hdb_catalog.event_invocation_logs
        WHERE event_id IN (SELECT * FROM invalid_event_ids)
    )
    DELETE FROM hdb_catalog.event_log
    WHERE trigger_name NOT IN (SELECT * FROM valid_trigger_names);

Courtesy: #5461 (comment)

Another way is to run this SQL, before any metadata apply:

CREATE TEMP TABLE invalid_triggers AS
WITH valid_event_triggers AS (
  SELECT name FROM hdb_catalog.event_triggers
),
  archive_invalid_events AS (
  UPDATE hdb_catalog.event_log set archived = 't'
  WHERE trigger_name NOT IN (select name from valid_event_triggers)
)
SELECT routine_name FROM information_schema.routines
WHERE routine_type='FUNCTION' AND specific_schema='hdb_views' AND routine_name NOT IN (
  SELECT 'notify_hasura_' || name || '_INSERT' FROM valid_event_triggers
  UNION
  SELECT 'notify_hasura_' || name || '_UPDATE' FROM valid_event_triggers
  UNION
  select 'notify_hasura_' || name || '_DELETE' FROM valid_event_triggers
);

DO $$ DECLARE
r RECORD;
BEGIN
  FOR r IN (SELECT routine_name from invalid_triggers) LOOP
    EXECUTE 'DROP FUNCTION IF EXISTS hdb_views.' || quote_ident(r.routine_name) || ' CASCADE';
  END LOOP;
END $$;

@dariocravero
Copy link
Contributor

dariocravero commented May 24, 2021

Hey @tirumaraiselvan this seems to be back in v2.0.0-alpha.10. The error is slightly different though could not find cron trigger in cache, but it happens after applying a metadata change (+reload) that removes a scheduled task. Seems related to #6733.

In case someone finds them handy, we used these SQL bits to clean the unused crons:

-- see crons that aren't in the metadata
select hc.id, hc.trigger_name, hc.status, hc.scheduled_time 
from hdb_catalog.hdb_cron_events hc
where hc.status = 'scheduled' and hc.trigger_name not in (
 select (hm.item->>'name')::text from (
   select json_array_elements(metadata->'cron_triggers') as item
   from hdb_catalog.hdb_metadata
 ) hm
)
order by hc.scheduled_time asc;


-- delete crons that aren't in the metadata
delete from hdb_catalog.hdb_cron_events hc
where hc.status = 'scheduled' and hc.trigger_name not in (
 select (hm.item->>'name')::text from (
   select json_array_elements(metadata->'cron_triggers') as item
   from hdb_catalog.hdb_metadata
 ) hm
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p/high candidate for being included in the upcoming sprint
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants