Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor #19

Merged
merged 4 commits into from
Apr 9, 2014
Merged

refactor #19

merged 4 commits into from
Apr 9, 2014

Conversation

mreiferson
Copy link
Member

this does essentially the same structural changes as pynsq in nsqio/pynsq#52

@mreiferson
Copy link
Member Author

work in progress... just pushing up what I had laying around locally...

@mreiferson
Copy link
Member Author

@kjk @shurcooL I've had this WIP refactoring PR open for a while (I've just rebased this on the deadlock fixes in #29)

If you're interested/willing to help test/verify this beta code in high volume/concurrency situations, that would be helpful!

I'd like to more aggressively work towards getting this tested and merged as I have some other public API refactoring that I think is long overdue (better sooner than later) cc @jehiah

@kjk
Copy link

kjk commented Apr 3, 2014

Currently the refactor_19 branch doesn't compile: ../../../bitly/go-nsq/reader.go:668: undefined: nsqConn

@mreiferson
Copy link
Member Author

fixed (posting from the right account this time)

Also, in a separate issue I intend to document my proposal for improving the public facing API

@kjk
Copy link

kjk commented Apr 3, 2014

There's very verbose logging turned on (which seems to include hexdump of messages) compared to master (without me changing any flags), which makes debugging under heavy-load impractical.

A sample verbose log message:

EVENT: MessageCB &{Id:[48 54 52 53 52 48 55 57 49 101 54 53 56 48 48 48] Body:[67 105 81 120 89 109 69 121 78 109 69 119 77 67 49 109 79 71 78 106 76 84 69 120 90 71 85 116 89 106 73 119 89 83 48 119 77 68 69 50 77 50 85 119 77 50 74 105 78 71 73 97 100 103 111 67 90 107 89 83 99 72 115 103 73 105 82 122 90 88 81 105 73 68 111 103 101 121 65 105 90 107 89 117 100 72 66 49 97 87 81 105 73 68 111 103 73 106 90 106 89 84 77 53 77 84 78 106 79 68 77 121 90 84 74 109 77 87 85 119 77 68 65 51 78 87 78 104 77 50 77 121 78 84 86 108 79 68 85 119 79 84 65 48 77 84 103 119 78 87 73 52 78 106 90 106 79 71 86 105 89 109 77 53 89 122 103 52 89 109 86 107 89 106 69 121 77 109 69 52 89 122 69 52 77 109 90 106 78 68 89 120 79 84 65 121 90 71 73 52 77 84 90 104 73 105 66 57 73 72 48 97 75 81 111 67 90 107 89 83 73 51 115 105 74 72 78 108 100 67 73 103 79 105 66 55 73 108 57 48 76 110 89 117 90 107 89 105 73 68 111 103 77 84 77 53 78 106 85 50 77 84 89 120 78 110 49 57] Timestamp:1396561616349113725 Attempts:1 exitChan:0xc2118dbd20 cmdChan:0xc2118dbea0 responseChan:0xc2118dbd80}

It generated 4.3G of logs in few minutes.

@mreiferson
Copy link
Member Author

whooops, sorry about that, I'll remove that debug logging

@mreiferson
Copy link
Member Author

@kjk quieted the logging, let me know if the Stop() method is still broken

@kjk
Copy link

kjk commented Apr 3, 2014

@mreiferson actually, Stop() wasn't broken, I was confused

@kjk
Copy link

kjk commented Apr 3, 2014

It seems to be running fine against very heavy nsq traffic.

It's not a deal breaker as it does work but looking at logs the shutdown sequence (i.e. calling r.Stop()) doesn't look "clean". I don't know if this is a change compared to master.

I'm using my test program from https://gist.github.com/kjk/41cd49704c635b35bfd1. I use ./test -workers=6,4 & and then do kill $pid to initiate shutdown and then wait on ExitChan for all of them to shutdown. And they do shutdown, but I wouldn't expect the error conditions like "use of closed network connection" to happen.

Full shutdown log is at https://gist.github.com/kjk/4401317d23f89f4ab3f3

A representitive sample:

2014/04/03 23:20:23 [server:4150/user_updates/test_many_nsq_readers] heartbeat received
2014/04/03 23:20:23 [server:4150/user_updates/test_many_nsq_readers] heartbeat received
2014/04/03 23:20:30 stopping reader
2014/04/03 23:20:30 stopping reader
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] error finishing 064555c8201f2000 - use of closed network connection
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] error finishing 064555ddff561000 - use of closed network connection
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] received ACK from nsqd - now in CLOSE_WAIT
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] IO Error - use of closed network connection
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] error finishing 064555e0e0ee0000 - use of closed network connection
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] received ACK from nsqd - now in CLOSE_WAIT
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] received ACK from nsqd - now in CLOSE_WAIT
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] error finishing 064555e0fa258000 - use of closed network connection
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] error finishing 064555de08d61000 - use of closed network connection
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] error finishing 064555dd11a34000 - use of closed network connection
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] received ACK from nsqd - now in CLOSE_WAIT
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] error finishing 064555e215f0b000 - use of closed network connection
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] error finishing 064555e33c2ea000 - use of closed network connection
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] error finishing 064555c8349f2000 - use of closed network connection
2014/04/03 23:20:30 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop

@mreiferson
Copy link
Member Author

@kjk last commit should resolve those Stop issues

@kjk
Copy link

kjk commented Apr 4, 2014

I'm still seeing e.g. "use of closed network connection" errors in the log:

2014/04/04 00:21:14 stopping reader
2014/04/04 00:21:14 stopping reader
2014/04/04 00:21:14 [server:4150/user_updates/test_many_nsq_readers] received ACK from nsqd - now in CLOSE_WAIT
2014/04/04 00:21:14 [server:4150/user_updates/test_many_nsq_readers] received ACK from nsqd - now in CLOSE_WAIT
2014/04/04 00:21:14 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop
2014/04/04 00:21:14 [server:4150/user_updates/test_many_nsq_readers] IO Error - read tcp 10.0.0.126:4150: use of closed network connection
2014/04/04 00:21:14 [server:4150/user_updates/test_many_nsq_readers] breaking out of writeLoop
2014/04/04 00:21:14 [server:4150/user_updates/test_many_nsq_readers] received ACK from nsqd - now in CLOSE_WAIT
2014/04/04 00:21:14 [server:4150/user_updates/test_many_nsq_readers] writeLoop exiting
2014/04/04 00:21:14 [server:4150/user_updates/test_many_nsq_readers] beginning stopWriteLoop

@mreiferson
Copy link
Member Author

Hmmm, it's no longer erroring on closed connection while trying to FIN messages though (good).

I will continue to poke around, thanks.

@mreiferson
Copy link
Member Author

@kjk I fixed a few more Stop related issues, would you mind giving this one more spin?

I should really write tests 😈

@mreiferson
Copy link
Member Author

RFR @jphines @ploxiln @jehiah

This diff is a mess, good luck, I'm sorry. I tried to keep these commits reasonably separate...

@kjk
Copy link

kjk commented Apr 4, 2014

This one does a clean shutdown - I don't see unexpected errors in the log.

@mreiferson
Copy link
Member Author

@kjk great! we'll clean this up and get it merged in, it's a much better foundation for the next round of changes I'm thinking of.

thanks for all your help!

@jehiah
Copy link
Member

jehiah commented Apr 4, 2014

I've skimmed this, but if possible would love @jphines or @ploxiln to take a closer look.

conn.OutputBufferTimeout = q.OutputBufferTimeout
}
conn.MessageCB = func(c *NSQConn, msg *Message) {
// log.Printf("[%s] EVENT: MessageCB %+v", c, msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented out debug statements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I'll drop those

@jphines
Copy link
Contributor

jphines commented Apr 4, 2014

@mreiferson I'll put this out in some places on prod over the weekend, see if anything pops up.

@mreiferson
Copy link
Member Author

@jphines what's the deal

case string:
return strconv.ParseBool(v.(string))
case int, int16, uint16, int32, uint32, int64, uint64:
return reflect.ValueOf(value).Int() == 0, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might want != 0 for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, good catch (that's currently broken in master)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@jphines
Copy link
Contributor

jphines commented Apr 7, 2014

was poking around here.
ran tcp kill for one host

2014/04/07 18:16:24 [nsqrelay17.ec2.bitly.net:4150/api_requests/stream_metrics-stage] heartbeat received
2014/04/07 18:16:24 [nsqrelay09.ec2.bitly.net:4150/newman_events/stream_metrics-stage] heartbeat received
2014/04/07 18:16:24 [nsqrelay09.ec2.bitly.net:4150/newman_events/stream_metrics-stage] IO Error - read tcp 10.215.147.147:4150: connection reset by peer
2014/04/07 18:16:24 [nsqrelay09.ec2.bitly.net:4150/newman_events/stream_metrics-stage] readLoop exiting
2014/04/07 18:16:24 [nsqrelay09.ec2.bitly.net:4150] connecting to nsqd
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x20 pc=0x42f522]

goroutine 175 [running]:
bitly/go-nsq._func_015(0xf8406d7fb8, 0x2aaaaac13d80, 0x2aaaaac13b88, 0x429f68, 0xf84026d000, ...)
    /tmp/go-install-as_t0Jh-Hri/src/bitly/go-nsq/reader.go:460 +0x9f
bitly/go-nsq.(*Reader).ConnectToNSQ(0xf8404181c0, 0xf8405560e0, 0xf80000001d, 0xf800000004, 0xf8405560e0, ...)
    /tmp/go-install-as_t0Jh-Hri/src/bitly/go-nsq/reader.go:467 +0x824
bitly/go-nsq.(*Reader).queryLookupd(0xf8404181c0, 0xf840419140)
    /tmp/go-install-as_t0Jh-Hri/src/bitly/go-nsq/reader.go:366 +0x671
bitly/go-nsq.(*Reader).lookupdLoop(0xf8404181c0, 0x0)
    /tmp/go-install-as_t0Jh-Hri/src/bitly/go-nsq/reader.go:306 +0x17e
created by bitly/go-nsq.(*Reader).ConnectToLookupd
    /tmp/go-install-as_t0Jh-Hri/src/bitly/go-nsq/reader.go:282 +0x23d

... and it barfed. hung up the entire reader.

@jphines
Copy link
Contributor

jphines commented Apr 7, 2014

the host was also an nsqd instances AND lookupd fwiw

@jphines
Copy link
Contributor

jphines commented Apr 8, 2014

hmm. looks like this locked up again. tracing log to see if I can find relevant information.

@jphines
Copy link
Contributor

jphines commented Apr 8, 2014

screen shot 2014-04-08 at 11 31 49 am
screen shot 2014-04-08 at 11 31 55 am

@mreiferson
Copy link
Member Author

@jphines @ploxiln see the last few commits which I believe resolve the connection close related deadlocks.

I've verified this running nsq_tail, built against this PR, connecting to a handful of producer nodes while using tcpkill to force closes.

Please re-build and deploy 😈

* make Conn public
* move transport responsibilities into Conn from Reader/Writer
* supply callbacks for hooking into Conn events
* name anon cleanup func; add to wg
* ensure that readLoop exits before exiting cleanup
* always check messagesInFlight at readLoop exit
* close underlying connection last
@jphines
Copy link
Contributor

jphines commented Apr 9, 2014

This is holding up.

jphines pushed a commit that referenced this pull request Apr 9, 2014
@jphines jphines merged commit 8d9370f into nsqio:master Apr 9, 2014
@mreiferson mreiferson deleted the refactor_19 branch April 9, 2014 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants