-
Notifications
You must be signed in to change notification settings - Fork 399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quartus streaming support for Activations, Dense & Batch Normalization #557
Conversation
|
Can you expand a bit on why multiple includes are needed (e8c9170). It seems a bit strange to me to include the same header file twice. |
@bo3z, can you fix the conflicts? I understand that it's high priority to get this in fairly soon, so I'd like to make a pr/557 to run all the tests. |
Unlike Vivado, Quartus generates look-up tables thorugh Python and stores them in .h files. These files simply contain a single C-array, the entries of the LUT. These header files are only included in the function using that LUT, not globally. Now I am aware that this is very bad practice, especially in a big project. One alternative is generating the same LUT twice, once with the suffix parallel and once with the suffix stream. This will slow down GCC compilation time, but should have no impact on HLS synthesis, as HLS compiler removes all unused code. Using a linker ( |
Rebased and pushed. All the tests should pass now. |
About the multiple includes, I think it's fine. The includes are used inside of functions, so effectively define the variables there, if I understand correctly. |
I see the following errors whey I run pytests:
|
(In general, though, the PR looks good.) |
Could you post the full CI/CD log please? Or maybe re-run the tests ? I just re-ran both of these tests locally and they passed. |
That's correct. |
They both passed. I don't understand what happened before. They had previously failed by not matching perfectly. I have to run now, but I want to push your branch to pr/557 on the upstream repository to trigger the pytests as part of the CI (you can do it if you want) and I'll look over the code one more time before approving. |
No clue what happened there, but I've seen this happen once before as well. Sometimes the CI just fails. I will push to pr/557 now, all tests should pass. I've just merged main into this branch. |
This PyTest still seems to be failing, by a very small margin. However, this test defaults to Vivado, so I'm not sure why that's the case? |
I wonder if it's a different random seed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks amazing. Can you fix the two minor cosmetic issues before we merge?
…to avoid generating same LUT twice.
All comments addressed. Needed to rebase one final time to avoid merge conflicts as well as include a small amount of code from a different branch in It might be worth including in the CI/CD process a step for synthesising Quartus designs, as we don't currently support that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vladimir should check his change requests first, but from my side things look good.
I'm fine with it. No idea about the test failure. It seems to choose the same seed every time for this PR, which is strange. I'll approve it, not sure if we should merge, or investigate the test more? For sure no changes in this PR cause the test to fail. |
I would maybe merge it (just to avoid further merge conflicts and kick of the work on streaming CNNs) and see if this test fails on other PRs, rebased on top of this one? |
Let's go!!! |
A PR enabling support for
io_stream
in the Qaurtus backend. Currently, the supported layers for streaming I/O are Dense, Activation and Batch Normalisation. Due to the inherent differences between parallel and streaming interfaces in Intel HLS, this is an extensive PR. Furthermore, Intel's HLSstream
has certain requirements when implemented (no copy constructor, global declaration etc.). Therefore, It is recommended to review this PR commit by commit (rather than side-by-side diff), with each commit being briefly explained below. Each commit is self-contained and can be checked out and the project compiled.2233a68 Allow multiple includes of the same LUT - This simply allows the same look-up tables to be used for both
io_parallel
andio_stream
, by removing theIFNDEF
guarde003234 Quartus custom Stream & distinction between g++ (nnet::stream) and i++ (ihc::stream):
hls4ml.compile()
orhls4ml.predict(...)
, GCC is used. However, GCC doesn't have access to HLS source files (which include Intel's streaming interface,ihc::stream
,ihc::stream_in
,ihc::stream_out
). Unlikeac_int
,ac_fixed
which are open-source and included in hls4ml, Intel's HLS stream source files are protected by licence and cannot be included in the repository. Therefore, a customnnet::stream
struct is written, having the same high-level function as Intel's HLSihc::stream
, but implemented using queues. Please note,pytests
written for streaming layers usennet::stream
. To verify correct functionality of the IP, use cosim with the following command:i++ -march=x86-64 -o myproject_test -v myproject_test.cpp firmware/myproject.cpp
void
- instead of returning a stream, it takes thestream
object by reference. It is not even possible to return a stream from a component, since the internal implementation contains a explicitly deleted copy constructor.stream_in
, outputs of typestream_out
and inter-component connection of typestream
. With this distinction, the HLS compiler is able to distinguish between component inputs and outputs. If onlystream
was used, the component would not synthesize with the correct inputs/outputsio_parallel
andio_stream
inmyproject.cpp
,myproject.h
anddefines.h
- in parallel, the top-level component takes a input struct, containing the array of data and returns a struct containing the output array. On the other hand, as explained above, streaming interfaces arevoid
, as both the input and output and output are passed by reference.io_parallel
), all of the input data is processed, stored to a vector and then passed to the component sequentially. However, due to an explicitly deleted copy constructor,ihc::stream
cannot be stored inside a vector. Therefore, this new benchmark processes data from input files and executes the component straight away.io_stream
, inter-layer connections (typestream
) must be declared outside ofmain()
- Intel HLS has a requirement through which all streams are either passed by reference to the top-level component or declared as global variables. Nostream
types can be instantiated inside the component.nnet::array
, as an array-like struct for storing data inside streams. Implementation similar to Vivado.1b17b3b Quartus Clone Optimizer - Sets up streaming optimizers in Quartus backend and adds the clone optimizer, as the most commonly used. From here, it should be straight forward to add further streaming passes.
d4da71f Tanh bug fix in Quartus - Addresses a small bug when invoking
TanH
activation. This is simply a pre-requisite for writing streaming activations.The other commits add support for streaming
Dense
,BatchNormalization
andActivation
, in a similar manner to Vivado, with the appropriate pipeline initiation interval:io_stream
. Some new tests (test_activations.py
) were written as well, to verify correct results of less-frequently used layers. Important to note, a passing PyTest does not imply correct HLS outputs - as explained above, PyTest usesnnet::stream
and HLS, when synthesizing usesihc::stream
- these two are inherently different, even though they offer the same high-level funcionality. To verify correct HLS behaviour, cosim should be used (explained above), as well as some RTL-level simulation, such as ModelSim or Questa. All of the above layers were tested using both PyTest and cosim. Finally, all of the above layers synthesized correctly and produced a valid IP block.#pragma HLS data_pack
directive, for which the Intel equivalent would behls_bankwidth
. However,hls_bankwidth
(similar to other memory-optimisation pragmas, except forhls_register
) is not supported for variables passed by reference/pointers, which is a necessity for correct functionality.