Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

playlist entry hang on 1/7/2020 on production at 2:12 pm #123

Closed
eric-gilbertson opened this issue Jan 11, 2020 · 23 comments
Closed

playlist entry hang on 1/7/2020 on production at 2:12 pm #123

eric-gilbertson opened this issue Jan 11, 2020 · 23 comments
Assignees
Labels
bug Something isn't working

Comments

@eric-gilbertson
Copy link
Collaborator

lois reports that playlist entry stopped working during her show. it appeared around 2:12 when she tried to perform a manual music entry. the subsequent entries were made from another PC after she complete her show.

https://zookeeper.stanford.edu/?action=viewDate&seq=selList&playlist=40105&session=

@RocketMan
Copy link
Owner

No changes were made at that time to prod or stage, so I suspect a glitch on the control A machine or the network, as she was able to access zk from another machine later.

@eric-gilbertson
Copy link
Collaborator Author

did you check the server log for any errors around this time?

@RocketMan
Copy link
Owner

There is nothing unusual in the error log, and requests continued to come in from the period of 2:12 until 2:30 with no gaps. According to the access log, a total of 2,483 requests were processed during that period, so zk was up and responding.

Given she should access it from another machine later, it suggests a transient error, either on the system in control A or the network.

@eric-gilbertson
Copy link
Collaborator Author

eric-gilbertson commented Jan 19, 2020

Reopening the issue because it happened to me at approximately 14:30 on 01/18/2020. Problem appears to be that the add click invokes a 'move down' operation rather than add which in turn resulted in an error response from the server and refreshing the page did not fix the issue. rough steps (on firefox) were:

  • create playlist & enter tracks
  • enter a tag ID
  • quickly click tab, select a track, click tab, click space - page is now inoperative
  • open debugger - see that there is a JSON response error and the the action was seqDown (or something like that).

Note it was difficult to debug further due to the javascript minification. suggest disabling it since the JS payload is small.

@RocketMan RocketMan added the bug Something isn't working label Jan 19, 2020
@RocketMan
Copy link
Owner

RocketMan commented Jan 19, 2020

Oooh. I wonder if the 'seq' you saw during the insert is related to the new code I added to order spins correctly in a live playlist that has 'future' entries? There is a variable seq involved in that.

Your change to prevent setting future times in a live list obviates this case, so we could just turn it off, but it should work correctly nonetheless, so I want to understand what is going on and deal with root cause.

Can you recall the exact js error message? That would be extremely helpful.

Also, add track does not use the legacy track up/down for its sequencing, so clicking add should not in any way involve a 'move down'. That's why the exact error message is crucial.

Note it was difficult to debug further due to the javascript minification. suggest disabling it since the JS payload is small.

There are only a dozen or so lines of code in this particular case, so it should be easy enough just to figure it out from the message alone.

For some of the more complex inline code, maybe we should look at refactoring it to external js files. I have a plan to add automatic map file generation for our external js, so it can be debugged with ease.

resulted in an error response from the server and refreshing the page did not fix the issue.

This is very interesting. What happens when you reload? I've never encountered a situation where javascript could so badly mess up the browser that a reload would not clear it. Have you ever encountered something like that, where the offending code is not run upon page reload, yet reload does not clear the problem?

Seems like carefully crafted javascript that can elicit this behaviour would be a nice exploit.

[Edit: added text, please re-read]

@RocketMan
Copy link
Owner

it happened to me at approximately 16:00 on 01/18/2020.

Can you please confirm the time? I was going to review the log, but I see your playlist ends at 1530 and the next playlist commences at 1800.

@RocketMan RocketMan self-assigned this Jan 19, 2020
@RocketMan RocketMan added this to the v2.6.1 milestone Jan 19, 2020
@eric-gilbertson
Copy link
Collaborator Author

time was incorrect, the correct time was approximately 14:30. can you obtain the creation time of the second (empty) playlist because it was most likely created as part of my effort to correct the issue.? fwiw, i've tried introducing json error returns to the two REST requests involved in the bug flow and i could not get the page to stall.

@RocketMan
Copy link
Owner

I'm not seeing anything interesting in the logs. There is an unrelated warning that pollutes the error log that ideally we will clear; I'll open a separate issue to track that.

We cannot tell when an insert occurred, but I thought I would be clever and review the access log and try to figure it out that way. Unfortunately, new playlist is a form post, so it is difficult to pick it out definitively in the access log. In the access log, I did find you had three sessions:

  1. 1205 - login (session=41a3bde65341419a336a411f8fc6e404)
    1219 - last add track
  2. 1336 - login (session=89f0ccc17f58af72a9b983598f711eae)
    1411 - last add track
  3. 1415 - login (session=0ecb17364762e566f6afb843afd28837)
    1429 - delete track (*see note)
    1433 - delete track (*see note)
    1454 - down track x 2 (**see note)
    1609 - logout (session=0ecb17364762e566f6afb843afd28837)

*Interesting were two delete tracks, at 1429 and at 1433, both of exactly the same track id (that is to say, multiple attempts to delete the same track.) That particular track was successfully deleted, so we cannot say what it was, but it originally sat between Dylan/Jokerman and Rucker/Straight to Hell. Not sure if delete track is implicated in any way, but it does occur around 1430.

**There were two distinct seq=downTrack events at 1454... all of the subsequent addTrack and getTracks ajax calls after that until 1537 include this URL in their Referer, as it was the last page load. A little out of range of 1430, but such a Referer could well appear as part of an error message, though not directly related to it.

I've been attempting to reproduce both in my dev env and in prod, using tabs and space, as well as various combinations adds, time edits, deletes, etc, but as yet, unable to reproduce.

@eric-gilbertson
Copy link
Collaborator Author

another point to keep in mind is that logging out/in cleared the issue so it would appear that it is somehow related to the user's session state.

@RocketMan
Copy link
Owner

I noticed something very interesting in the log. The following extract consists of log entries from the control A mac during your show from 1410-1415:

  1. 171.66.118.58 - - [18/Jan/2020:14:10:49 -0800] "POST /?action=addTrack&oaction=editListEditor HTTP/1.1" 200 4544 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:72.0) Gecko/20100101 Firefox/72.0"
  2. 171.66.118.58 - - [18/Jan/2020:14:10:50 -0800] "POST /?action=addTrack&oaction=editListEditor HTTP/1.1" 200 4391 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:72.0) Gecko/20100101 Firefox/72.0"
  3. 171.66.118.58 - - [18/Jan/2020:14:10:50 -0800] "POST /?action=addTrack&oaction=editListEditor HTTP/1.1" 200 4391 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:72.0) Gecko/20100101 Firefox/72.0"
  4. 171.66.118.58 - - [18/Jan/2020:14:11:46 -0800] "POST /?action=addTrack&oaction=editListEditor HTTP/1.1" 200 1065 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Safari/605.1.15"
  5. 171.66.118.58 - - [18/Jan/2020:14:15:03 -0800] "GET / HTTP/1.1" 200 16295 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:72.0) Gecko/20100101 Firefox/72.0"
  6. 171.66.118.58 - - [18/Jan/2020:14:15:06 -0800] "GET /?session=&action=search HTTP/1.1" 200 6857 "https://zookeeper.stanford.edu/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:72.0) Gecko/20100101 Firefox/72.0"
  7. 171.66.118.58 - - [18/Jan/2020:14:15:18 -0800] "POST /? HTTP/1.1" 200 9523 "https://zookeeper.stanford.edu/?session=&action=search" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:72.0) Gecko/20100101 Firefox/72.0"
  8. 171.66.118.58 - - [18/Jan/2020:14:15:21 -0800] "GET /?s=byAlbumKey&n=903855&q=50&action=search&session= HTTP/1.1" 200 8191 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:72.0) Gecko/20100101 Firefox/72.0"
  9. 171.66.118.58 - - [18/Jan/2020:14:15:24 -0800] "GET /ssoLogin.php HTTP/1.1" 307 579 "https://zookeeper.stanford.edu/?s=byAlbumKey&n=903855&q=50&action=search&session=" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:72.0) Gecko/20100101 Firefox/72.0"

Lines 1-4 are ajax addTrack requests. Lines 1-3 are from Firefox, while line 4 is a request from Safari!

It turns out you have the playlist editor open in two browsers concurrently!!

Referring back to the forensics in the previous comment, I can confirm your login at 1205 is from Firefox while the next one at 1336 is from Safari. You had two concurrent, active sessions open in two different browsers, and you then attempt to edit the same live playlist via both sessions.

It seems things went bad immediately after the Safari addTrack (line 4), as we see a new login from Firefox (line 9) shortly thereafter.

Being logged in from two browsers is fine, but working on the same playlist at the same time from two browsers is undefined. As you know, the playlist editor does not sync the playlist on each addTrack, it just uses the local DOM, which may not reflect the actual state of the playlist, if that playlist is being modified concurrently by another browser.

@RocketMan
Copy link
Owner

RocketMan commented Jan 22, 2020

Lois's failure mode is different to yours. She was using Chrome Safari only. It seems her browser became unresponsive, as it got stuck in a loop sending addTrack requests: There were 58 addTrack requests sent in rapid succession from 1416-1417 during her show on 7 Jan.

There are two problems here, 1. addTrack failure on the service, and 2. the js code should just report errors and fail, not get stuck in a loop like this.

I am continuing to investigate. Any insights will be appreciated.

[Edit: Lois's UA string is for Chrome, not Safari.]

@RocketMan
Copy link
Owner

The log for Lois looks like this:

171.66.118.58 - - [07/Jan/2020:14:16:24 -0800] "POST /?action=addTrack&oaction=newListEditor HTTP/1.1" 200 10851 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
171.66.118.58 - - [07/Jan/2020:14:16:27 -0800] "POST /?action=addTrack&oaction=newListEditor HTTP/1.1" 200 4312 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
171.66.118.58 - - [07/Jan/2020:14:16:31 -0800] "POST /?action=addTrack&oaction=newListEditor HTTP/1.1" 200 4312 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
171.66.118.58 - - [07/Jan/2020:14:16:31 -0800] "POST /?action=addTrack&oaction=newListEditor HTTP/1.1" 200 4312 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
171.66.118.58 - - [07/Jan/2020:14:16:31 -0800] "POST /?action=addTrack&oaction=newListEditor HTTP/1.1" 200 4312 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
171.66.118.58 - - [07/Jan/2020:14:16:32 -0800] "POST /?action=addTrack&oaction=newListEditor HTTP/1.1" 200 4312 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
171.66.118.58 - - [07/Jan/2020:14:16:32 -0800] "POST /?action=addTrack&oaction=newListEditor HTTP/1.1" 200 4312 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
171.66.118.58 - - [07/Jan/2020:14:16:32 -0800] "POST /?action=addTrack&oaction=newListEditor HTTP/1.1" 200 4312 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
171.66.118.58 - - [07/Jan/2020:14:16:32 -0800] "POST /?action=addTrack&oaction=newListEditor HTTP/1.1" 200 4312 "https://zookeeper.stanford.edu/?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"

...and it continues in this way a couple dozen more times with rapid-fire addTrack requests.

Unfortunately, we don't know what was in the request. Apparently the server did not like what it saw (as no track was added) and the browser did not like what it received (as it kept on trying). No errors logged on the server.

@eric-gilbertson
Copy link
Collaborator Author

just to clarify, i did not invoke safari until after firefox stalled. i will continue to look out for the error. again, i'd suggest disabling javascript minification so that the issue can be debugged should it happen again.

@RocketMan
Copy link
Owner

just to clarify, i did not invoke safari until after firefox stalled.

In that case, maybe your Firefox stalled much earlier than 1430?

The logs show that you logged in from Safari at 1336, opened Edit Playlist, and did a couple add tracks (1336, 1339). Then you go back to Firefox addTrack at 1406, Safari addTrack at 1407, Firefox addTrack at 1410, Safari addTrack at 1411. So there is definitely concurrent access to the playlist over a number of minutes from both browsers. We cannot discount that the error you saw is related to this.

FWIW, there was no activity from either browser between 1257 and 1336. The last addTrack from Firefox before that was at 1219.

In any event, please do report the error if you see it again.

i'd suggest disabling javascript minification so that the issue can be debugged should it happen again.

OK, it is now disabled in Playlists.php in prod. jQuery will still be minified with no source map; you will have to deal with that.

@RocketMan
Copy link
Owner

Server-side error logging has been deployed to production to capture errors on track insert and update.

@RocketMan
Copy link
Owner

Access logs from 2020-01-07 and 2020-01-18

studio-a.log.20200107.txt
studio-a.log.20200118.txt

@RocketMan
Copy link
Owner

RocketMan commented Jan 26, 2020

Per Eric: ZK hang at studio approximately 1:52 pm on Sat 25 Jan

Access logs from 2020-01-25

studio-a.log.20200125.txt
error.log.20200125.txt

@RocketMan
Copy link
Owner

Eric, it seems you are able to reproduce this almost every show... can you reproduce it on a machine that is not the Control A mac?

In the most recent logs, you did a delete shortly before the time in question... there were some deletes just before the time of the issue on the 18th as well... coincidence?

I have been trying, but still unable to reproduce.

@eric-gilbertson
Copy link
Collaborator Author

jim, this has occurred to me only once on 01/18/2020. i, (as opposed to mike) had no problem with my show on 01/25/2020. i will continue to keep watch for it though. just a hunch, but i vaguely recall quickly tabbing from the tag ID -> track select -> Add fields when my one exception occurred.

@RocketMan
Copy link
Owner

RocketMan commented Jan 27, 2020

I believe I have found it. Offending code was the hidden track-session in the playlist editor form. Note the anomalous value attribute (diff is original to fixed):

index 3411f5c..9a03127 100644
--- a/ui/Playlists.php
+++ b/ui/Playlists.php
@@ -828,7 +828,7 @@ class Playlists extends MenuItem {
 
     ?>
         <div class='pl-form-entry'>
-            <input id='track-session' type='hidden' value='session VALUE="<?php echo $this->session->getSessionID(); ?>'>
+            <input id='track-session' type='hidden' value='<?php echo $this->session->getSessionID(); ?>'>
             <input id='track-playlist' type='hidden' value='<?php echo $playlistId; ?>'>
             <label></label><span id='error-msg' class='error'></span>
             <div>

track-session is used to populate the session parameter in the addTrack request. Normally, this parameter is not used PROVIDED THAT the browser is also supplying the session cookie.

Problem manifests when there is no session cookie. In this case, zk looks to the form value, which in this case is not the session ID as expected, so the request fails to authenticate and is not processed.

Not sure how I lost my session cookie, but this is how I discovered it. Failure mode then matched exactly that described in this slip.

Fix is now in prod. Let's see how this goes...

[Edit: In addition, I have just opened #129 to revisit session cookie scope, to deal with the disappearing cookie aspect of this issue. Even without that, however, the issue should be resolved, as addTrack is now providing the expected session ID.]

@RocketMan
Copy link
Owner

RocketMan commented Jan 31, 2020

This should already be resolved, but am keeping open for now and moving to the latest hotfix milestone.

@RocketMan RocketMan modified the milestones: v2.6.1, v2.7.1 Jan 31, 2020
@RocketMan
Copy link
Owner

Keeping open on the latest hotfix milestone

@RocketMan RocketMan modified the milestones: v2.7.1, v2.7.2 Feb 2, 2020
@RocketMan RocketMan modified the milestones: v2.7.2, v2.8.1 Feb 17, 2020
@RocketMan RocketMan removed this from the v2.8.1 milestone May 8, 2020
@RocketMan
Copy link
Owner

The problem is that stage and prod were pointed to two different databases, however cookies are shared between all instances on the same server. When a browser window was opened to stage or prod without the session ID (such as could happen simply because of a new window), the cookie would be used. If the cookie was created in the other instance (stage or prod), the session ID was invalid and the cookie deleted. This means any other windows needing the cookie (for example, because the session ID was not supplied in the request, such as the broken track add above), then it would fail.

So the immediate cause -- the broken session ID in track add -- has been fixed. The root cause is that the session cookie breaks when stage and prod point to different databases. We won't do that again until we can scope session cookies to each unique instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants