Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing a log twice will count entries twice using on-disk store (dumb protection) #334

Closed
billietl opened this issue Nov 19, 2015 · 31 comments

Comments

@billietl
Copy link

When analyzing log files with on-disk B+Tree feature enabled (--keep-db-files --load-from-disk at each run), parsing a log file twice will process and count each entry twice. Would be a good thing to avoid such feature.

Possible way to do this :
_ remember wich log entry has been analyzed (would take lots of space on disk)
_ remember date from last log entry (log files will have to be processed in chronological order)
_ something else I didn't think of ?

@allinurl
Copy link
Owner

Thanks for pointing this out. Do you actually need to parse the same log twice? To read persisted data only (without parsing new data) you could do:

goaccess --load-from-disk --keep-db-files

@aphorise
Copy link

@allinurl - can you confirm if sets are against each input / files? - thats to say are counted-sets anew / separate based on the series of input files?

As a related side note what do you think of digesting the URI in a similar hash-table lookup similar to the proposition currently under development for the UA's? - so basically something such as:

echo -n 'htttp://uri_request...' | sha256sum --tag
# or perhaps slightly shorter 
echo -n 'htttp://uri_request...' | sha1sum --tag

As a comparative counter for each logged request. The only draw back with this approach is that it does intensify the input process as sha methods are arguably not light - which would require executions against each read entry; it must also be mentioned that this does a sort of compaction / compression of strings for in-memory store irrespective of the original string that was read.

Alternatively a better approach may be to do this on each input-file - which would reasonably suffice for any length file / size (thinking of ranges at or above < 2^61``+) where the probability of a collision are superbly low and may be ignored eg:

if [[ $(sha256sum F1.log) == $(sha256sum F2.log) ]] ; then echo "IGNORING: F1 & F2 since both are the same" ; else echo "PARSING: F1 or F2" ; fi ; 

I think the later may be an easy & light-enough winner for us to include?

@allinurl
Copy link
Owner

@aphorise - that's correct, input files go against the data set on disk.

I think I favor the second approach, where it's done on each input-file, however, in the case where input files differ, then I'm not sure how would you get the offset line (last parsed line)?

A hash-table lookup, like you said, it would pretty intense to compute the sha per request. Maybe as @billietl mentioned, just keeping track of the last line parsed (assuming the log is in chronological order) ?

@aphorise
Copy link

@allinurl - It would be great if we can strive to digest whole blocks / chunks of the file. Where file sizes / input differ - then we can take the smaller set of all and compare against the same sha output on the larger sets. So something like:

if [[ $(tail -c 1000 F1.log | sha256sum) == $(tail -c 1000 F2.log | sha256sum) ]] ; then echo "IGNORING: F1 & F2 since both are the same" ; else echo "PARSING: F1 or F2" ; fi ; 
# taking the first '-c 1000' Bytes

Of course the files would require some reasonable minimal and all of these approach would only be useful for exact chunks / files that are the same in both. As you also mentioned - one can also take the first & last lines as extra comparisons too - though this would perhaps add too many levels of assumption since for example the first & last line of a file could be the only entries that are a replica from other sources. For these sort of approach a modular smaller chunk comparison would be ideal.

Taking smallest file - split by X (largest possible value) with Y bytes - then for each X do sha blocks - then for each Y bytes in the same order in the next file do the same comparison to determine if any chunks can be ignored.

All this is a partial solution - though it would minimally reduce exact duplicates that are similarly in the exact same order.

@allinurl allinurl changed the title [new feature] Processing a log twice will count entries twice (dumb protection) Processing a log twice will count entries twice using on-disk store (dumb protection) Nov 24, 2015
@LeeNX
Copy link

LeeNX commented Nov 9, 2016

Just run into this problem, by accident, ran a test script twice and the data doubled. Acknowledgement of issue. Might be an idea to document in the on-disk storage section.

As for finding a way to avoid duplicating data on-disk using log stamp will not work with piped data. And idea to hash/checksum the complete log line might be over kill, how about only hashing/checksuming part of the data? Which part, I am not sure, date/time, IP, URL ...

@skorokithakis
Copy link

My use case for wanting this is that I want to run GoAccess against my latest logs multiple times, for updates. If runs are not idempotent, and they count visitors twice, it would be very inconvenient.

@skorokithakis
Copy link

skorokithakis commented Jan 8, 2017

Thinking about this some more, I think that one simple rule will take care of 99% of the problems here. Basically, the rule is:

"Make a note of the latest log line timestamp in the stored dataset and don't import anything before that from the log files in this run."

So, if I ran goaccess on my logfile that ended at 15:02:18 today, and then ran it again a bit later, it would start up, see that the last piece of data in my db is 15:02:18, and any line earlier than or exactly at 15:02:18 in the logfile specified in the commandline would be ignored.

This would prevent double-counting unless the log files specified in this session contained duplicates (e.g. if someone specified a logfile and a copy of the same logfile), which strikes me as much rarer than the case where we just want to parse a logfile that was changed.

This simple change would make goaccess much, much more convenient for me. As it stands, I can't really use it as my sole analytics software because I can't be sure it won't double-count if I run it on an old logfile.

I'm really bad with C but this is so essential to me that I'd be willing to try writing a PR, if someone could point me in the right direction.

@skorokithakis
Copy link

By the way, if you're worried about logs being processed out of order (i.e. someone wanting to process the most recent log first), there are two ways that can be addressed:

  1. Goaccess can refuse to process items between its bounds. For example, if you have log lines from Jan 1 to Jan 30, you can refuse to process log items between those dates again. However, this might be an issue with people processing the Jan log, then the May log, then the Feb log, etc.
  2. A more flexible solution would be to just make this an option. Passing --idempotent or --update-only, goaccess will refuse to process log lines between dates it has already seen.

This (together with --process-and-exit) will allow people to set goaccess to update its logs during the day, from the same log, without worrying about duplicate items. We can simply add a cron job a few times a day that will read the latest logfile, and, whenever we want to see statistics, we can open goaccess at any point (possibly parsing the latest logfile again) to see completely up-to-date statistics, going back months or years, without needing to keep or reprocess the logs, or worry about getting duplicate data.

This is the use case I think most people want from goaccess, as it's how Google Analytics and other software works. I'm also very excited about being able to finally ditch GA for a non-intrusive (and more accurate) solution.

@allinurl
Copy link
Owner

allinurl commented Jan 12, 2017

Thanks for your input. Here are my thoughts on this:

1- Add the ability to process multiple logs directly from goaccess. e.g., goaccess -f access.log -f access.log.1. This will give greater flexibility and avoid having to use zcat/cat for multiple files.

2- To process log files (no piped data) incrementally, we keep the inodes of all the files processed (assuming files will stay on the same partition) along with the last line parsed of each file and the timestamp of the last line parsed. e.g., inode:29627417|line:20012|ts:20171231235059

If the inode does not match the current file, parse all lines. If the current file matches the inode, we then read the remaining lines and update the count of lines parsed and the timestamp. Also, as you suggested and as an extra precaution, have a flag that if passed, it won't parse log lines with a timestamp ≤ than the one stored.

3- Piped data is a bigger problem. It will have to be based on the flag mentioned above and the timestamp of the last line read. For instance, it will parse and discard all incoming entries until it finds a timestamp > than the one stored (should it be inclusive?).

The issue here is that you could have multiple consecutive lines with the same timestamp (even at the second level), so a few may end up as duplicates. However, I assume that in most cases, for incremental log parsing, people would be parsing the logs as they are without pre-processing them and thus #1 would come in handy.

Unless I'm missing something, I think this would be a reasonable implementation that should address most of the problems of duplicate data for incremental parsing.

Thoughts?

@skorokithakis
Copy link

I completely second the above. I like the inode optimization, although I think it's important that the timestamp-aware exclusion be added, as sometimes logrotate might copy files and the inode will change, so duplicates should be avoided in that case.

I also think the timestamp comparison needs to be exclusive of the last time, as I would rather miss a few lines than double-count a few, but I don't feel very strongly about this.

Keep in mind that, if you accept multiple files in the command line, people will expect that they will all be live-updated. This might not be an issue, I'm just pointing it out.

Overall, this feature sounds great to me. I want to write a post about the inaccuracy of analytics and how GoAccess/Piwik can solve many of the problems at once, but I'd like GoAccess to be a bit more usable for the "set it and forget it" crowd before I recommend it. This feature would get me all the way there, and I could then show people how to set up GoAccess so they always get live and historical data at once.

Thank you!

@allinurl
Copy link
Owner

@skorokithakis Looks like unless copytruncate is used in logrotate, it just renames it. If it just moves the file and is within the file system boundaries, then it should just perform a metadata change, so it should preserve the inode number, which should work as expected :)

I'll look into this and post back. Thanks again and stay tuned!

@skorokithakis
Copy link

Is there any update on this? I'd really like to write that post on using GoAccess for analytics, but without this feature the UX is too hard...

@allinurl
Copy link
Owner

@skorokithakis From my previous comment, the first part (1) is done, and will be deployed soon in the upcoming version. 2 and 3 will probably come in the release after. I want to leave v1.1 as stable as possible before moving into 1.2. Stay tuned :)

@skorokithakis
Copy link

Ah, alright, thank you!

@SeLLeRoNe
Copy link

I'm interested on this too :)
I'm just replying to receive updates.

Thanks

@chennin
Copy link

chennin commented Jun 10, 2017

I just installed 1.2-1 from the deb.goaccess.io repo and am hitting this bug. skorokithakis' usecase in the Jan 9 comment is my use case:

  • Parse existing logs once (some compressed) to HTML
  • Update the HTML every X hours
  • Not live updates

What's the current workaround for this? Hack up a rotate & parsing schedule?

@allinurl
Copy link
Owner

allinurl commented Jun 10, 2017

@chennin I'm working on this on a local branch while keeping up with some bug reports. However, a workaround can be easily scripted in bash, something like this should help for now:

#!/bin/bash

# change the following two...
LOG=/var/log/nginx/access.log
LASTREAD=/home/user/.goaccess.last

OFFSET=$(sed -n '$=' $LOG)
START=1
if [[ -s $LASTREAD ]]; then
  START=$(<$LASTREAD)
  START="$(($START + 1))"
fi

sed -n $START','$OFFSET'p' < $LOG
echo "$OFFSET" > "$LASTREAD"

then you can simply run it as:

./myscript.sh | goaccess - --load-from-disk --keep-db-files --log-format=COMBINED

The script will basically start reading from the first line of your log until the end of it, then it will save the last recorded line into a goaccess.last file that will be used as the starting point for the next run. Feel free to modify it to suit your env :)

@FedericoCeratto
Copy link

Perhaps goaccess could be made smart enough to understand that all the files named foo.log[.<number>[.gz]] belong to the same source (instead of passing -f multiple times) and scan older files only if needed, based on their mtime.

However, the whole point is to feed goaccess old data on its first run. Maybe this can be done externally with a script like #926

Related #925

@allinurl
Copy link
Owner

allinurl commented May 9, 2018

@freemp This is awesome, thanks for sharing it!

allinurl added a commit that referenced this issue May 10, 2020
…ring data.

See #334 for double counting
protecting on persisted data.
@allinurl
Copy link
Owner

This has been added to v1.4 (implemented as stated here) - finally. This release is around the corner, so stay tuned next week. Thanks again for all the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests