Processing a log twice will count entries twice using on-disk store (dumb protection) #334

billietl · 2015-11-19T14:49:28Z

When analyzing log files with on-disk B+Tree feature enabled (--keep-db-files --load-from-disk at each run), parsing a log file twice will process and count each entry twice. Would be a good thing to avoid such feature.

Possible way to do this :
_ remember wich log entry has been analyzed (would take lots of space on disk)
_ remember date from last log entry (log files will have to be processed in chronological order)
_ something else I didn't think of ?

allinurl · 2015-11-19T16:12:09Z

Thanks for pointing this out. Do you actually need to parse the same log twice? To read persisted data only (without parsing new data) you could do:

goaccess --load-from-disk --keep-db-files

aphorise · 2015-11-19T16:54:54Z

@allinurl - can you confirm if sets are against each input / files? - thats to say are counted-sets anew / separate based on the series of input files?

As a related side note what do you think of digesting the URI in a similar hash-table lookup similar to the proposition currently under development for the UA's? - so basically something such as:

echo -n 'htttp://uri_request...' | sha256sum --tag
# or perhaps slightly shorter 
echo -n 'htttp://uri_request...' | sha1sum --tag

As a comparative counter for each logged request. The only draw back with this approach is that it does intensify the input process as sha methods are arguably not light - which would require executions against each read entry; it must also be mentioned that this does a sort of compaction / compression of strings for in-memory store irrespective of the original string that was read.

Alternatively a better approach may be to do this on each input-file - which would reasonably suffice for any length file / size (thinking of ranges at or above < 2^61``+) where the probability of a collision are superbly low and may be ignored eg:

if [[ $(sha256sum F1.log) == $(sha256sum F2.log) ]] ; then echo "IGNORING: F1 & F2 since both are the same" ; else echo "PARSING: F1 or F2" ; fi ;

I think the later may be an easy & light-enough winner for us to include?

allinurl · 2015-11-19T17:31:22Z

@aphorise - that's correct, input files go against the data set on disk.

I think I favor the second approach, where it's done on each input-file, however, in the case where input files differ, then I'm not sure how would you get the offset line (last parsed line)?

A hash-table lookup, like you said, it would pretty intense to compute the sha per request. Maybe as @billietl mentioned, just keeping track of the last line parsed (assuming the log is in chronological order) ?

aphorise · 2015-11-19T20:01:31Z

@allinurl - It would be great if we can strive to digest whole blocks / chunks of the file. Where file sizes / input differ - then we can take the smaller set of all and compare against the same sha output on the larger sets. So something like:

if [[ $(tail -c 1000 F1.log | sha256sum) == $(tail -c 1000 F2.log | sha256sum) ]] ; then echo "IGNORING: F1 & F2 since both are the same" ; else echo "PARSING: F1 or F2" ; fi ; 
# taking the first '-c 1000' Bytes

Of course the files would require some reasonable minimal and all of these approach would only be useful for exact chunks / files that are the same in both. As you also mentioned - one can also take the first & last lines as extra comparisons too - though this would perhaps add too many levels of assumption since for example the first & last line of a file could be the only entries that are a replica from other sources. For these sort of approach a modular smaller chunk comparison would be ideal.

Taking smallest file - split by X (largest possible value) with Y bytes - then for each X do sha blocks - then for each Y bytes in the same order in the next file do the same comparison to determine if any chunks can be ignored.

All this is a partial solution - though it would minimally reduce exact duplicates that are similarly in the exact same order.

LeeNX · 2016-11-09T14:30:06Z

Just run into this problem, by accident, ran a test script twice and the data doubled. Acknowledgement of issue. Might be an idea to document in the on-disk storage section.

As for finding a way to avoid duplicating data on-disk using log stamp will not work with piped data. And idea to hash/checksum the complete log line might be over kill, how about only hashing/checksuming part of the data? Which part, I am not sure, date/time, IP, URL ...

skorokithakis · 2017-01-08T02:48:07Z

My use case for wanting this is that I want to run GoAccess against my latest logs multiple times, for updates. If runs are not idempotent, and they count visitors twice, it would be very inconvenient.

skorokithakis · 2017-01-08T04:03:17Z

Thinking about this some more, I think that one simple rule will take care of 99% of the problems here. Basically, the rule is:

"Make a note of the latest log line timestamp in the stored dataset and don't import anything before that from the log files in this run."

So, if I ran goaccess on my logfile that ended at 15:02:18 today, and then ran it again a bit later, it would start up, see that the last piece of data in my db is 15:02:18, and any line earlier than or exactly at 15:02:18 in the logfile specified in the commandline would be ignored.

This would prevent double-counting unless the log files specified in this session contained duplicates (e.g. if someone specified a logfile and a copy of the same logfile), which strikes me as much rarer than the case where we just want to parse a logfile that was changed.

This simple change would make goaccess much, much more convenient for me. As it stands, I can't really use it as my sole analytics software because I can't be sure it won't double-count if I run it on an old logfile.

I'm really bad with C but this is so essential to me that I'd be willing to try writing a PR, if someone could point me in the right direction.

skorokithakis · 2017-01-09T12:17:49Z

By the way, if you're worried about logs being processed out of order (i.e. someone wanting to process the most recent log first), there are two ways that can be addressed:

Goaccess can refuse to process items between its bounds. For example, if you have log lines from Jan 1 to Jan 30, you can refuse to process log items between those dates again. However, this might be an issue with people processing the Jan log, then the May log, then the Feb log, etc.
A more flexible solution would be to just make this an option. Passing --idempotent or --update-only, goaccess will refuse to process log lines between dates it has already seen.

This (together with --process-and-exit) will allow people to set goaccess to update its logs during the day, from the same log, without worrying about duplicate items. We can simply add a cron job a few times a day that will read the latest logfile, and, whenever we want to see statistics, we can open goaccess at any point (possibly parsing the latest logfile again) to see completely up-to-date statistics, going back months or years, without needing to keep or reprocess the logs, or worry about getting duplicate data.

This is the use case I think most people want from goaccess, as it's how Google Analytics and other software works. I'm also very excited about being able to finally ditch GA for a non-intrusive (and more accurate) solution.

allinurl · 2017-01-12T23:29:53Z

Thanks for your input. Here are my thoughts on this:

1- Add the ability to process multiple logs directly from goaccess. e.g., goaccess -f access.log -f access.log.1. This will give greater flexibility and avoid having to use zcat/cat for multiple files.

2- To process log files (no piped data) incrementally, we keep the inodes of all the files processed (assuming files will stay on the same partition) along with the last line parsed of each file and the timestamp of the last line parsed. e.g., inode:29627417|line:20012|ts:20171231235059

If the inode does not match the current file, parse all lines. If the current file matches the inode, we then read the remaining lines and update the count of lines parsed and the timestamp. Also, as you suggested and as an extra precaution, have a flag that if passed, it won't parse log lines with a timestamp ≤ than the one stored.

3- Piped data is a bigger problem. It will have to be based on the flag mentioned above and the timestamp of the last line read. For instance, it will parse and discard all incoming entries until it finds a timestamp > than the one stored (should it be inclusive?).

The issue here is that you could have multiple consecutive lines with the same timestamp (even at the second level), so a few may end up as duplicates. However, I assume that in most cases, for incremental log parsing, people would be parsing the logs as they are without pre-processing them and thus #1 would come in handy.

Unless I'm missing something, I think this would be a reasonable implementation that should address most of the problems of duplicate data for incremental parsing.

Thoughts?

skorokithakis · 2017-01-13T02:22:10Z

I completely second the above. I like the inode optimization, although I think it's important that the timestamp-aware exclusion be added, as sometimes logrotate might copy files and the inode will change, so duplicates should be avoided in that case.

I also think the timestamp comparison needs to be exclusive of the last time, as I would rather miss a few lines than double-count a few, but I don't feel very strongly about this.

Keep in mind that, if you accept multiple files in the command line, people will expect that they will all be live-updated. This might not be an issue, I'm just pointing it out.

Overall, this feature sounds great to me. I want to write a post about the inaccuracy of analytics and how GoAccess/Piwik can solve many of the problems at once, but I'd like GoAccess to be a bit more usable for the "set it and forget it" crowd before I recommend it. This feature would get me all the way there, and I could then show people how to set up GoAccess so they always get live and historical data at once.

Thank you!

allinurl · 2017-01-13T04:20:20Z

@skorokithakis Looks like unless copytruncate is used in logrotate, it just renames it. If it just moves the file and is within the file system boundaries, then it should just perform a metadata change, so it should preserve the inode number, which should work as expected :)

I'll look into this and post back. Thanks again and stay tuned!

skorokithakis · 2017-02-18T17:29:03Z

Is there any update on this? I'd really like to write that post on using GoAccess for analytics, but without this feature the UX is too hard...

allinurl · 2017-02-20T14:52:28Z

@skorokithakis From my previous comment, the first part (1) is done, and will be deployed soon in the upcoming version. 2 and 3 will probably come in the release after. I want to leave v1.1 as stable as possible before moving into 1.2. Stay tuned :)

skorokithakis · 2017-02-20T14:57:26Z

Ah, alright, thank you!

SeLLeRoNe · 2017-03-01T13:36:47Z

I'm interested on this too :)
I'm just replying to receive updates.

Thanks

chennin · 2017-06-10T13:48:14Z

I just installed 1.2-1 from the deb.goaccess.io repo and am hitting this bug. skorokithakis' usecase in the Jan 9 comment is my use case:

Parse existing logs once (some compressed) to HTML
Update the HTML every X hours
Not live updates

What's the current workaround for this? Hack up a rotate & parsing schedule?

allinurl · 2017-06-10T14:51:02Z

@chennin I'm working on this on a local branch while keeping up with some bug reports. However, a workaround can be easily scripted in bash, something like this should help for now:

#!/bin/bash

# change the following two...
LOG=/var/log/nginx/access.log
LASTREAD=/home/user/.goaccess.last

OFFSET=$(sed -n '$=' $LOG)
START=1
if [[ -s $LASTREAD ]]; then
  START=$(<$LASTREAD)
  START="$(($START + 1))"
fi

sed -n $START','$OFFSET'p' < $LOG
echo "$OFFSET" > "$LASTREAD"

then you can simply run it as:

./myscript.sh | goaccess - --load-from-disk --keep-db-files --log-format=COMBINED

The script will basically start reading from the first line of your log until the end of it, then it will save the last recorded line into a goaccess.last file that will be used as the starting point for the next run. Feel free to modify it to suit your env :)

FedericoCeratto · 2017-10-29T16:51:46Z

Perhaps goaccess could be made smart enough to understand that all the files named foo.log[.<number>[.gz]] belong to the same source (instead of passing -f multiple times) and scan older files only if needed, based on their mtime.

However, the whole point is to feed goaccess old data on its first run. Maybe this can be done externally with a script like #926

Related #925

allinurl · 2018-05-09T11:54:52Z

@freemp This is awesome, thanks for sharing it!

…ring data. See #334 for double counting protecting on persisted data.

allinurl · 2020-05-14T22:04:03Z

This has been added to v1.4 (implemented as stated here) - finally. This release is around the corner, so stay tuned next week. Thanks again for all the feedback.

allinurl changed the title ~~[new feature] Processing a log twice will count entries twice (dumb protection)~~ Processing a log twice will count entries twice using on-disk store (dumb protection) Nov 24, 2015

allinurl mentioned this issue Aug 7, 2016

append data equal awstats #255

Closed

allinurl mentioned this issue Sep 27, 2016

High CPU usage using on-disk storage #523

Closed

allinurl mentioned this issue Jan 4, 2017

GoAccess counts log file again and again #608

Closed

skorokithakis mentioned this issue Jan 8, 2017

Feature request: Quit after processing. #613

Closed

allinurl mentioned this issue Mar 1, 2017

Extracting virtual host from the file name #673

Closed

allinurl added log-processing on-disk labels Jul 26, 2017

allinurl mentioned this issue Jul 26, 2017

Next release of GoAccess #851

Closed

allinurl mentioned this issue Aug 15, 2017

Include HTTP 429 Too many Requests? #825

Closed

allinurl mentioned this issue Sep 20, 2017

Weird problem with invalid data in html report #887

Closed

allinurl mentioned this issue Oct 29, 2017

Scan gzipped log files #925

Closed

allinurl mentioned this issue Jun 20, 2018

Error in report when import multple logs #1150

Closed

allinurl mentioned this issue Jul 17, 2018

Feature request: extend a usage scenario #1171

Closed

allinurl mentioned this issue Aug 7, 2018

GoAccess 1.2 crashed by Sig 11 using on-disk storage #1187

Closed

allinurl mentioned this issue Nov 30, 2019

Each run is re-adding log data from all previous runs #1617

Closed

allinurl mentioned this issue Apr 14, 2020

Counts hits etc multiple times if log configured in .conf & command line #1736

Closed

allinurl added a commit that referenced this issue May 10, 2020

Ensure valid requests counter is not affected on duplicate when resto…

2364f2d

…ring data. See #334 for double counting protecting on persisted data.

allinurl closed this as completed May 14, 2020

allinurl mentioned this issue Jul 8, 2020

zcat individual zip files pipe to go access and keep history #1816

Closed

allinurl mentioned this issue Oct 25, 2020

Duplicated request entries on "Requested Files" panel #1945

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing a log twice will count entries twice using on-disk store (dumb protection) #334

Processing a log twice will count entries twice using on-disk store (dumb protection) #334

billietl commented Nov 19, 2015

allinurl commented Nov 19, 2015

aphorise commented Nov 19, 2015

allinurl commented Nov 19, 2015

aphorise commented Nov 19, 2015

LeeNX commented Nov 9, 2016

skorokithakis commented Jan 8, 2017

skorokithakis commented Jan 8, 2017 •

edited

Loading

skorokithakis commented Jan 9, 2017

allinurl commented Jan 12, 2017 •

edited

Loading

skorokithakis commented Jan 13, 2017

allinurl commented Jan 13, 2017

skorokithakis commented Feb 18, 2017

allinurl commented Feb 20, 2017

skorokithakis commented Feb 20, 2017

SeLLeRoNe commented Mar 1, 2017

chennin commented Jun 10, 2017 •

edited

Loading

allinurl commented Jun 10, 2017 •

edited

Loading

FedericoCeratto commented Oct 29, 2017

allinurl commented May 9, 2018

allinurl commented May 14, 2020

Processing a log twice will count entries twice using on-disk store (dumb protection) #334

Processing a log twice will count entries twice using on-disk store (dumb protection) #334

Comments

billietl commented Nov 19, 2015

allinurl commented Nov 19, 2015

aphorise commented Nov 19, 2015

allinurl commented Nov 19, 2015

aphorise commented Nov 19, 2015

LeeNX commented Nov 9, 2016

skorokithakis commented Jan 8, 2017

skorokithakis commented Jan 8, 2017 • edited Loading

skorokithakis commented Jan 9, 2017

allinurl commented Jan 12, 2017 • edited Loading

skorokithakis commented Jan 13, 2017

allinurl commented Jan 13, 2017

skorokithakis commented Feb 18, 2017

allinurl commented Feb 20, 2017

skorokithakis commented Feb 20, 2017

SeLLeRoNe commented Mar 1, 2017

chennin commented Jun 10, 2017 • edited Loading

allinurl commented Jun 10, 2017 • edited Loading

FedericoCeratto commented Oct 29, 2017

allinurl commented May 9, 2018

allinurl commented May 14, 2020

skorokithakis commented Jan 8, 2017 •

edited

Loading

allinurl commented Jan 12, 2017 •

edited

Loading

chennin commented Jun 10, 2017 •

edited

Loading

allinurl commented Jun 10, 2017 •

edited

Loading