Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Add bayesian training #2430

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

seemanne
Copy link
Member

No description provided.

@github-actions
Copy link

@seemanne: There are no 'kind' label on this PR. You need a 'kind' label to generate the release automatically.

  • /kind feature
  • /kind enhancement
  • /kind fix
  • /kind chore
  • /kind dependencies
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

@github-actions
Copy link

@seemanne: There are no area labels on this PR. You can add as many areas as you see fit.

  • /area agent
  • /area local-api
  • /area cscli
  • /area security
  • /area configuration
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

@seemanne seemanne changed the title add basic trainer class Add bayesian training Aug 16, 2023
@seemanne seemanne changed the title Add bayesian training [Draft] Add bayesian training Aug 16, 2023
@seemanne
Copy link
Member Author

/kind feature
/area configuration

@codecov
Copy link

codecov bot commented Aug 16, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (92f923c) 56.94% compared to head (ba7a5a3) 37.79%.

Files Patch % Lines
pkg/leakybucket/bayesian.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2430       +/-   ##
===========================================
- Coverage   56.94%   37.79%   -19.16%     
===========================================
  Files         195      191        -4     
  Lines       26675    26302      -373     
===========================================
- Hits        15191     9940     -5251     
- Misses       9901    15018     +5117     
+ Partials     1583     1344      -239     
Flag Coverage Δ
bats 37.79% <0.00%> (ø)
unit-linux ?
unit-windows ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 122 to 264

for _, v := range s.ParsedIpEvents {
go evaluateProgramOnBucket(&v, compiled, inputChan)
}

go controllerRoutine(inputChan, outputChan, s.total)
Copy link
Contributor

@sbs2001 sbs2001 Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not aware of the larger context of where this fits in. Looks like you’re adding new function in scenarios/parsers. That being said I think using go routines won't add performance but is only making the code more complex. The “go-routined” function evaluateProgramOnBucket is not doing any I/O work hence doesn't make sense to use go routines on it. Something like

var result evalHypothesisResult
for _, v := range s.ParsedIpEvents {
 r := evaluateProgramOnBucket(&v)
// update result
}

Would have equivalent if not better performance while eliminating the overhead of controller routine etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok maybe I should explain. The whole thing is the main training loop for the bayesian buckets. The construction works the following way:

  • The logs are loaded into the LogEventStorage into the map with the key being the Ip and the all the events for this Ip are added to the fakeBucket
  • The user can then test different hypothesis expr to see if any of them would make for good conditions in the bucket using TestHypothesis
  • To speed up the hypothesis testing the idea way to run it in parallel threads for each IP (as its basically counting some stuff per IP)
  • The goal of the channel/goroutine design in TestHypothesis is to enable this parallelism by spawing an independent routine for each IP (fake bucket) and then collecting all the results using the controller.

Does this make more sense now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, the goroutines would definitely increase throughput. If the training is CPU intensive task than goroutine would make sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants