Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullReferenceException in ExecuteScript Method When Executing Lua Scripts #635

Open
danielchristianschroeter opened this issue Sep 4, 2024 · 29 comments · May be fixed by #672
Open

NullReferenceException in ExecuteScript Method When Executing Lua Scripts #635

danielchristianschroeter opened this issue Sep 4, 2024 · 29 comments · May be fixed by #672

Comments

@danielchristianschroeter

Describe the bug

When attempting to execute Lua scripts using GarnetServer, a NullReferenceException is thrown in the ExecuteScript method. This issue occurs when using the EVAL or EVALSHA commands.

A NullReferenceException is thrown, indicating that an object reference is not set to an instance of an object.

Error Message:

2024-09-04 18:28:21 04::28::21 fail: Session[0] [192.168.80.2:6379] [192.168.80.1:42348] [01400C58] Error executing Lua script callback
2024-09-04 18:28:21 04::28::21 crit: Session[0] [192.168.80.2:6379] [192.168.80.1:42348] [01400C58] ProcessMessages threw an exception: System.NullReferenceException: Object reference not set to an instance of an object.    at Garnet.server.RespServerSession.ExecuteScript(Int32 count, LuaRunner scriptRunner) in /src/libs/server/Lua/LuaCommands.cs:line 235    at Garnet.server.RespServerSession.TryEVAL() in /src/libs/server/Lua/LuaCommands.cs:line 97    at Garnet.server.RespServerSession.ProcessArrayCommands[TGarnetApi](RespCommand cmd, TGarnetApi& storageApi) in /src/libs/server/Resp/RespServerSession.cs:line 664    at Garnet.server.RespServerSession.ProcessMessages() in /src/libs/server/Resp/RespServerSession.cs:line 420    at Garnet.server.RespServerSession.TryConsumeMessages(Byte* reqBuffer, Int32 bytesReceived) in /src/libs/server/Resp/RespServerSession.cs:line 299

The LuaRunner object might not be properly initialized or retrieved, leading to a null reference when attempting to execute the script.

This report is related to the Lua implementation in #550

Steps to reproduce the bug

  1. Enable Lua scripting in GarnetServer.
  2. Attempt to execute a Lua script using the EVAL or EVALSHA command.
    Go Code Using RedisLock from https://github.com/bsm/redislock/blob/main/redislock.go:
func (rs *RedisStorage) Lock(ctx context.Context, name string) error {

	key := rs.prefixLock(name)

	for {
		// try to obtain lock
		lock, err := rs.locker.Obtain(ctx, key, lockTTL, &redislock.Options{})

		// lock successfully obtained
		if err == nil {
			// store the lock in sync.map, needed for unlocking
			rs.locks.Store(key, lock)
			// keep the lock fresh as long as we hold it
			go func(ctx context.Context, lock *redislock.Lock) {
				for {
					// refresh the Redis lock
					err := lock.Refresh(ctx, lockTTL, nil)
					if err != nil {
						return
					}

					select {
					case <-time.After(lockRefreshInterval):
					case <-ctx.Done():
						return
					}
				}
			}(ctx, lock)

			return nil
		}

		// check for unexpected error
		if err != redislock.ErrNotObtained {
			return fmt.Errorf("Unable to obtain lock for %s: %v", key, err)
		}

		// lock already exists, wait and try again until cancelled
		select {
		case <-time.After(lockPollInterval):
		case <-ctx.Done():
			return ctx.Err()
		}
	}
}

Related Lua scripts:
https://github.com/bsm/redislock/blob/main/obtain.lua
https://github.com/bsm/redislock/blob/main/refresh.lua

  1. Observe the NullReferenceException in the logs of garnet server.

Expected behavior

The Lua script should execute without throwing a NullReferenceException.

Screenshots

No response

Release version

v1.0.19

IDE

No response

OS version

No response

Additional context

Docker Compose service configuration:
docker-compose.yml

services:
  garnet:
    image: 'ghcr.io/microsoft/garnet:1.0.19'
    ulimits:
      memlock: -1
    ports:
      - "6379:6379"
    volumes:
      - garnetdata:/data
      - ./garnet.conf:/app/garnet.conf
    command: ["--config-import-path", "/app/garnet.conf"]

garnet.conf

{
	/* Enable Lua scripts on server. */
	"EnableLua" : true,

	/* Run Lua scripts as a transaction (lock keys - run script - unlock keys). */
	"LuaTransactionMode" : true,
}
@mmclure-msft
Copy link
Contributor

I'm seeing the same error when trying to use Redisson maps. Dug down to the following line in LuaCommands.cs:

if (storeWrapper.storeScriptCache.TryGetValue(d, out var source))

which does not return a value. Looking at storeScriptCache shows it to be empty:
image
So it looks like somehow the script isn't properly getting stored.

@mmclure-msft
Copy link
Contributor

Digging into my case further, it is correct that there is no value returned. If I run with Redis I get a NOSCRIPT error back on the same execution. So I may be running into something different.

@mmclure-msft
Copy link
Contributor

Looks like this is at least part of the problem:

while (!RespWriteUtils.WriteError("ERR " + ex.InnerException.Message, ref dcurr, dend))

ex.innerException is null.
Changing that line to
while (!RespWriteUtils.WriteError("ERR " + (ex.InnerException ?? ex).Message, ref dcurr, dend))
fixes the crash and forwards the error on to the client.

@badrishc
Copy link
Contributor

badrishc commented Sep 5, 2024

The fix LGTM, if you can create a PR that would be great.

@badrishc
Copy link
Contributor

badrishc commented Sep 5, 2024

@danielchristianschroeter - please check whether this change unblocks your example.

@mmclure-msft
Copy link
Contributor

Created #642

@danielchristianschroeter
Copy link
Author

@mmclure-msft
Thanks to your suggested changes, the NullReferenceException is resolved. However, I’m still encountering "Error executing Lua script callback". Any chance to get more information?

2024-09-05 17:11:55 03::11::55 fail: Session[0] [172.22.0.1:37528] [005C39D4] Error executing Lua script callback

@mmclure-msft
Copy link
Contributor

I'm seeing similar issues. I think there may be subtle (or not so subtle) differences in how NLua parses the scripts vs. the library that's used by Redis. Setting --logger-level Trace on the command line may help get more details on what's failing - for example I'm seeing

08::32::03 fail: Session[0] [10.0.0.123:6379] [127.0.0.1:64740] [004B8E6D] Error executing Lua script System.IndexOutOfRangeException: Index was outside the bounds of the array.    at System.Array.GetFlattenedIndex(Int32 rawIndex)    at System.Array.GetValue(Int32 index)    at NLua.MetaFunctions.TryAccessByArray(Lua luaState, Type objType, Object obj, Object index)    at NLua.MetaFunctions.GetMethodInternal(Lua luaState)    at NLua.MetaFunctions.GetMethod(IntPtr state)    at KeraLua.NativeMethods.lua_pcallk(IntPtr luaState, Int32 nargs, Int32 nresults, Int32 errorfunc, IntPtr ctx, IntPtr k)    at KeraLua.Lua.PCall(Int32 arguments, Int32 results, Int32 errorFunctionIndex)    at NLua.Lua.CallFunction(Object function, Object[] args, Type[] returnTypes)    at NLua.Lua.CallFunction(Object function, Object[] args)    at NLua.LuaFunction.Call(Object[] args)    at Garnet.server.LuaRunner.RunInternal(String[] keys, String[] argv) in C:\src\garnet\libs\server\Lua\LuaRunner.cs:line 374    at Garnet.server.LuaRunner.RunTransactionInternal(String[] keys, String[] argv) in C:\src\garnet\libs\server\Lua\LuaRunner.cs:line 323    at Garnet.server.LuaRunner.Run(Int32 count, SessionParseState parseState) in C:\src\garnet\libs\server\Lua\LuaRunner.cs:line 281    at Garnet.server.RespServerSession.ExecuteScript(Int32 count, LuaRunner scriptRunner) in C:\src\garnet\libs\server\Lua\LuaCommands.cs:line 196

But I think it might be best to track this under a new issue once we have more details.

@danielchristianschroeter
Copy link
Author

Thank you!

I already tried setting the logger level to Trace, but I still don’t get additional information in my case. All I receive is:

fail: Session[0] [192.168.0.1:51102] [005818D1] Error executing Lua script callback

Without more details, it’s hard to debug. But of course, we can close this issue if the PR is merged.

@danielchristianschroeter
Copy link
Author

After adding detailed logging to the ExecuteScript function within libs/server/Lua/LuaCommands.cs, I identified that the Lua script is attempting to call a function named status_reply, which is not defined.

Log Output:

2024-09-05 17:49:16 03::49::16 info: Session[0] [192.168.48.1:41822] [0283D742] Executing Lua script with count: 5
2024-09-05 17:49:16 03::49::16 info: Session[0] [192.168.48.1:41822] [0283D742] Lua script input: Garnet.server.SessionParseState
2024-09-05 17:49:16 03::49::16 fail: Session[0] [192.168.48.1:41822] [0283D742] Error executing Lua script callback. Inner Exception: (null) NLua.Exceptions.LuaScriptException: [string "..."]:2: attempt to call a nil value (field 'status_reply')

Relevant Code:

private unsafe bool ExecuteScript(int count, LuaRunner scriptRunner)

private unsafe bool ExecuteScript(int count, LuaRunner scriptRunner)
{
    try
    {
        logger?.LogInformation("Executing Lua script with count: {count}", count);
        logger?.LogInformation("Lua script input: {parseState}", parseState);

        object scriptResult = scriptRunner.Run(count, parseState);
        logger?.LogInformation("Lua script executed successfully. Result: {scriptResult}", scriptResult);

        if (scriptResult != null)
        {
            if (scriptResult is string s)
            {
                while (!RespWriteUtils.WriteAsciiBulkString(s, ref dcurr, dend))
                    SendAndReset();
            }
            else if ((scriptResult as byte?) != null && (byte)scriptResult == 36) //equals to $
            {
                while (!RespWriteUtils.WriteDirect((byte[])scriptResult, ref dcurr, dend))
                    SendAndReset();
            }
            else if (scriptResult as Int64? != null)
            {
                while (!RespWriteUtils.WriteInteger((Int64)scriptResult, ref dcurr, dend))
                    SendAndReset();
            }
            else if (scriptResult as ArgSlice? != null)
            {
                while (!RespWriteUtils.WriteBulkString(((ArgSlice)scriptResult).ToArray(), ref dcurr, dend))
                    SendAndReset();
            }
            else if (scriptResult as Object[] != null)
            {
                // Two objects one boolean value and the result from the Lua Call
                while (!RespWriteUtils.WriteAsciiBulkString((scriptResult as Object[])[1].ToString().AsSpan(), ref dcurr, dend))
                    SendAndReset();
            }
        }
        else
        {
            while (!RespWriteUtils.WriteDirect(CmdStrings.RESP_ERRNOTFOUND, ref dcurr, dend))
                SendAndReset();
        }
    }
    catch (LuaScriptException ex)
    {
        logger?.LogError(ex, "Error executing Lua script callback. Inner Exception: {InnerException}", ex.InnerException);
        while (!RespWriteUtils.WriteError("ERR " + (ex.InnerException ?? ex).Message, ref dcurr, dend))
            SendAndReset();
        return true;
    }
    catch (Exception ex)
    {
        logger?.LogError(ex, "Error executing Lua script");
        while (!RespWriteUtils.WriteError("ERR " + ex.Message, ref dcurr, dend))
            SendAndReset();
        return true;
    }
    return true;
}

@mmclure-msft
Copy link
Contributor

I think the main issue is that much of the API documented at https://redis.io/docs/latest/develop/interact/programmability/lua-api/ is not yet implemented. It will probably require community involvement to add the functionality to bring Garnet Lua scripting to parity with Redis. Definitely not a couple of small PRs 😄

@mmclure-msft
Copy link
Contributor

mmclure-msft commented Sep 5, 2024

@danielchristianschroeter can you see if mmclure-msft@b357a63 fixes your status_reply error?
EDIT: Never mind - there seems to be some issue with returning a table - this needs more investigation.

@danielchristianschroeter
Copy link
Author

danielchristianschroeter commented Sep 5, 2024

@mmclure-msft
Thank you for your assistance! Your suggestions were very helpful.

Log Output:
info: Session[0] [172.24.0.1:40662] [027B1BBF] Lua script executed successfully. Result: table

Even though the Lua script executes successfully, my code still fails with ErrNotObtained and Unable to obtain lock: context deadline exceeded. This happens because the result is returned as a table instead of a string, which causes compatibility issues with the original Redis-based code.

To fix this, in addition to your suggested changes, I modified the code in libs/server/Lua/LuaRunner.cs to ensure the Lua script returns a string, as required for compatibility with the original Redis code (referencing this line).

Here’s the quick and dirty test approach I used:

        object RunInternal(string[] keys, string[] argv)
        {
            if (keys != this.keys)
            {
                if (keys == null)
                {
                    this.keys = null;
                    sandbox_env["KEYS"] = this.keys;
                }
                else
                {
                    if (this.keys != null && keys.Length == this.keys.Length)
                        Array.Copy(keys, this.keys, keys.Length);
                    else
                    {
                        this.keys = keys;
                        sandbox_env["KEYS"] = this.keys;
                    }
                }
            }
            if (argv != this.argv)
            {
                if (argv == null)
                {
                    this.argv = null;
                    sandbox_env["ARGV"] = this.argv;
                }
                else
                {
                    if (this.argv != null && argv.Length == this.argv.Length)
                        Array.Copy(argv, this.argv, argv.Length);
                    else
                    {
                        this.argv = argv;
                        sandbox_env["ARGV"] = this.argv;
                    }
                }
            }

            var result = function.Call();
            return ProcessLuaResult(result);
        }

        object ProcessLuaResult(object[] result)
        {
            if (result.Length > 0)
            {
                var firstResult = result[0];
                if (firstResult is LuaTable luaTable)
                {
                    if (luaTable["ok"] != null)
                    {
                        return luaTable["ok"].ToString();
                    }
                }
                return firstResult;
            }
            return null;
        }

        /// <summary>
        /// Runs the precompiled Lua function
        /// </summary>
        /// <returns></returns>
        public object Run()
        {
            var result = function.Call();
            return ProcessLuaResult(result);
        }

Log Output After Changes:
info: Session[0] [172.29.0.1:60954] [037A9C05] Lua script executed successfully. Result: 1

After applying these changes, the code now executes successfully.

@mmclure-msft
Copy link
Contributor

Interesting. Per https://redis.io/docs/latest/develop/interact/programmability/lua-api it's supposed to return a table - I'll need to experiment with real Redis and see what happens.

@danielchristianschroeter
Copy link
Author

I see your point, but for some reason the same code is working fine with Redis v7.4.0.

@mmclure-msft
Copy link
Contributor

With a version of your code I have this almost working the same as Redis for both redis.status_reply and redis.error_reply - the only difference is:

127.0.0.1:6379> EVAL "return { ok = 'Success' }" 0
Success
127.0.0.1:6379> EVAL "return { err = 'Failure' }" 0
(error) Failure
127.0.0.1:6379> EVAL "return redis.status_reply('Success')" 0
Success
127.0.0.1:6379> EVAL "return redis.error_reply('Failure')" 0
(error) Failure

in Redis vs.

172.24.80.1:6379> EVAL "return { ok = 'Success' }" 0
"Success"
(5.56s)
172.24.80.1:6379> EVAL "return { err = 'Failure' }" 0
(error) Failure
(4.28s)
172.24.80.1:6379> EVAL "return redis.status_reply('Success')" 0
"Success"
(3.62s)
172.24.80.1:6379> EVAL "return redis.error_reply('Failure')" 0
(error) Failure
(4.11s)

in Garnet (ignore the timings - that's just because I had breakpoints enabled). The difference is Success vs. "Success" and I haven't yet found where that's coming from and whether it's a problem or not.

@mmclure-msft
Copy link
Contributor

Added PR #645 to handle this correctly.

@danielchristianschroeter
Copy link
Author

danielchristianschroeter commented Sep 6, 2024

Thank you! This looks good, and the blocking issue appears to be resolved.

However, I've encountered a new undesired behavior: I'm getting a "lock not held" error during the Release, although this doesn't cause any blocking behavior. I previously also set "LuaTransactionMode" to "false."
From what I can tell, the lock seems to be acquired successfully. Even after several attempts, this behavior cannot be reproduced with the Redis v7.4.0.
Unfortunately, I don't see any hint in the logs of Garnet.

Here’s an example code:

func (rs *RedisStorage) Unlock(ctx context.Context, name string) error {

	key := rs.prefixLock(name)

	// load and delete lock from sync.Map
	if syncMapLock, loaded := rs.locks.LoadAndDelete(key); loaded {

		// type assertion for Redis lock
		if lock, ok := syncMapLock.(*redislock.Lock); ok {

			// release the Redis lock
			if err := lock.Release(ctx); err != nil {
				return fmt.Errorf("Unable to release lock for %s: %v", key, err)
			}
		}
	}

	return nil
}

For reference, here’s the related function in redislock:
func (l *Lock) Release(ctx context.Context) error {

@mmclure-msft
Copy link
Contributor

I'm pretty sure the issue is with this line in release.lua:
https://github.com/bsm/redislock/blob/bfd385b776d1e069eca5cb6b6cc41d635cbb4e55/release.lua#L5
because unpack isn't implemented. There's still a lot of work to do to get Garnet up to parity with Redis in terms of LUA scripting. I'll chip away at some of it but it's not going to be a short-term thing.

@mmclure-msft
Copy link
Contributor

Actually, unpack should be implemented:

unpack = unpack;

Will dig a little deeper.

@badrishc
Copy link
Contributor

For obtain.lua, I am hitting an index out of bounds issue with the ipairs command. Are you not hitting this issue?

This seems to be an open issue in NLua, as per this: NLua/NLua#472

@badrishc
Copy link
Contributor

badrishc commented Sep 12, 2024

This unit test works: https://github.com/microsoft/garnet/blob/badrishc/lua-more-fns/test/Garnet.test/LuaScriptTests.cs#L352

But if I replace "a" with "KEYS" which is populated with 1 based offsets, the ipairs fails. This is what obtain.lua does:

local setArgs = {}
for _, key in ipairs(KEYS) do
	table.insert(setArgs, key)
	table.insert(setArgs, ARGV[1])
end

@badrishc
Copy link
Contributor

I worked around the ipairs issue by adding an extra null at the end of the KEYS string array in this PR: #655 so that works, but unpack(KEYS) does not work as it throws a NLua exception: attempt to get length of a System.String[]

@badrishc
Copy link
Contributor

It seems a more robust solution would be to make KEYS and ARGV LuaTable types rather than C# string arrays. That way, all Lua functions on them would be mostly guaranteed to work.

@badrishc
Copy link
Contributor

badrishc commented Sep 18, 2024

The linked PR #672 should support most existing Lua scripts - let me know if you see any issues.

@danielchristianschroeter
Copy link
Author

Thanks!
Unfortunately, I still see the reported issue with "lock not held" from comment #635 (comment)

@badrishc
Copy link
Contributor

Good news - this is now fixed in the latest commit to my linked PR. The example from https://github.com/bsm/redislock?tab=readme-ov-file#examples now works unmodified. If your code still does not work, please provide a stand-alone example similar to the redislock example I linked, so that I can run it directly using go run ..

@danielchristianschroeter
Copy link
Author

@badrishc,
I wanted to follow up and let you know that I've tested the latest commit, and I'm happy to report that it's working very smoothly without any issues now! The improvements you've made to handle Lua scripting in Garnet are fantastic, and I'm confident that many users will appreciate the capability to seamlessly replace Lua Redis implementations with Garnet.

A huge thank you to both you and @mmclure-msft for your tremendous effort and time invested in resolving these issues. It’s wonderful to see the progress that's been made!

Let's close this issue if the PR is merged.

@danielchristianschroeter
Copy link
Author

danielchristianschroeter commented Sep 19, 2024

@badrishc
Args, wait a minute. I tested now also https://github.com/bsm/redislock?tab=readme-ov-file#examples

It seems, this part is not working as expected:

	// Sleep and check the remaining TTL.
	time.Sleep(50 * time.Millisecond)
	if ttl, err := lock.TTL(ctx); err != nil {
		log.Fatalln(err)
	} else if ttl > 0 {
		fmt.Println("Yay, I still have my lock!")
	}

Response with Garnet:

go run .
I have a lock!
Now, my lock has expired!

Response with Redis (Docker image redis:latest):

 go run .
I have a lock!
Yay, I still have my lock!
Now, my lock has expired!

Update:
It looks like Garnet is setting the TTL to zero immediately after obtaining the lock, which is not the expected behavior. This discrepancy could be due to differences in how Garnet handles TTL settings compared to Redis.

Slightly modified main.go with additional logging:

package main

import (
	"context"
	"fmt"
	"log"
	"time"

	"github.com/bsm/redislock"
	"github.com/redis/go-redis/v9"
)

func main() {
	start := time.Now()

	// Connect to redis.
	client := redis.NewClient(&redis.Options{
		Network: "tcp",
		Addr:    "127.0.0.1:6380", // Update Port for Garnet/Redis Endpoint
	})
	defer client.Close()
	fmt.Printf("Redis connection time: %v\n", time.Since(start))

	// Create a new lock client.
	locker := redislock.New(client)
	fmt.Printf("Lock client creation time: %v\n", time.Since(start))

	ctx := context.Background()

	// Try to obtain lock.
	lockStart := time.Now()
	lock, err := locker.Obtain(ctx, "my-key", 300*time.Millisecond, nil) // Increased lock duration
	if err == redislock.ErrNotObtained {
		fmt.Println("Could not obtain lock!")
	} else if err != nil {
		log.Fatalln(err)
	}
	fmt.Printf("Lock obtain time: %v\n", time.Since(lockStart))

	// Don't forget to defer Release.
	defer lock.Release(ctx)
	fmt.Println("I have a lock!")

	// Sleep and check the remaining TTL.
	time.Sleep(50 * time.Millisecond)
	ttlStart := time.Now()
	if ttl, err := lock.TTL(ctx); err != nil {
		log.Fatalln(err)
	} else if ttl > 0 {
		fmt.Printf("Yay, I still have my lock! TTL: %v\n", ttl)
	} else {
		fmt.Println("Lock TTL is zero!")
	}
	fmt.Printf("TTL check time: %v\n", time.Since(ttlStart))

	// Extend my lock.
	refreshStart := time.Now()
	if err := lock.Refresh(ctx, 300*time.Millisecond, nil); err != nil {
		log.Fatalln(err)
	}
	fmt.Printf("Lock refresh time: %v\n", time.Since(refreshStart))

	// Sleep a little longer, then check.
	time.Sleep(100 * time.Millisecond)
	expireCheckStart := time.Now()
	if ttl, err := lock.TTL(ctx); err != nil {
		log.Fatalln(err)
	} else if ttl == 0 {
		fmt.Println("Now, my lock has expired!")
	} else {
		fmt.Printf("Lock still valid! TTL: %v\n", ttl)
	}
	fmt.Printf("Final TTL check time: %v\n", time.Since(expireCheckStart))

	fmt.Printf("Total execution time: %v\n", time.Since(start))
}

With Garnet:

go run .
Redis connection time: 0s
Lock client creation time: 0s
Lock obtain time: 4.149ms
I have a lock!
Lock TTL is zero!
TTL check time: 1.8605ms
Lock refresh time: 2.6271ms
Lock still valid! TTL: 198ms
Final TTL check time: 1.1119ms
Total execution time: 160.9905ms

With Redis:

go run .
Redis connection time: 0s
Lock client creation time: 0s
Lock obtain time: 2.6355ms
I have a lock!
Yay, I still have my lock! TTL: 249ms
TTL check time: 945µs
Lock refresh time: 502.7µs
Lock still valid! TTL: 198ms
Final TTL check time: 1.0737ms
Total execution time: 156.6583ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants