-
Notifications
You must be signed in to change notification settings - Fork 7
Ouissal/tutorials babi6 #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Some general comments on the text in the preprocessing notebook:
|
Please double check Tiffany's original comments and incorporate these as there are several which are not addressed |
" print(\"the minimum size is: \" + str(min_size))\n", | ||
" \n", | ||
" # Randomly sample from each group to balance\n", | ||
" balanced_positive = random.sample(positive_items, min_size)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be seeded?
"TRAINING_SAMPLE_SIZE = 120\n", | ||
"VALIDATION_SAMPLE_SIZE = 30\n", | ||
"TEST_SAMPLE_SIZE = 30\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You define these here, but I don't think they're used later?
In the next tutorial the validation set has 36 items in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. These are redundant variables from a previous version where we tweaked the number of entries.
… as it is still under investigation
We noticed that there is a reproducibility problem in this experiment. While the accuracies obtained are generally good: between 80% and 100%, the results of the experiments are not reproducible once run on the same data despite forcing the same seeding. Here is what we know:
|
…ts to reduce verbosity of the tutorial by simplifiying sentences, rewriting them, or making them shorter. The reproducibility problem persists, and one might notice that some changes to the training part have been added to reflect debugging for this particular problem. Will be removed after the problem is solved.
Please can you do another pass to remove all instances of the word 'subejct' to the word 'person', also applies for 'context' or story which should instead be text for consitency (and also e.g. ctx in the code). And any instances of 'question asking' which remain - e.g For more details about question asking in {term}DisCoCirc, -> for more details on the choice of the implementation of quesitons in discocirc see .... |
"\n", | ||
"After extracting the texts, they are filtered to only keep the ones whose number of sentences is less than or equal to `TEXT_LENGTH`, which we set in the previous cell to determine the maximum number of sentences that we want in a text for better efficiency. This is to make sure that we do not get huge circuits later on when we convert the texts into circuits, which might slow down the experiment. \n", | ||
"\n", | ||
"After this filtering, the last step is to convert the list of texts from a list of arrays of sentences, to a list of sentences. In other words, we concatenate the sentences in each text (which is an array) to obtain a string." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function returns the texts as strings.
" texts = []\n", | ||
" qnas = []\n", | ||
" text_length = []\n", | ||
" for story in stories:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the switch from text to story? Also ctx as a variable is outdated refering to context
"id": "b04a8ff9", | ||
"metadata": {}, | ||
"source": [ | ||
"Now that the text circuits are post-processed for optimization, it is time to make the assertion circuits to later sequentially compose the latter with the former. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be said in a simpler way
"We start by creating a layer composed of either identities (to link with the wires corresponding to the question nouns), or discards (for the rest of the wires). Once we sequentially compose this layer with the text circuit, this leaves us with a circuit whose codomain has two wires corresponding to the question nouns. In order for us to attach the assertion boxes, we have to make sure that the wires from the assertion circuits are linked to the right wires from the text circuit. To achieve this, we check the question ids of the wires in the text circuits (to see whether the nouns in the text circuits are in the right order). This helps us decide whether to use the assertion boxes that come with swaps, or the ones without swaps (if the question wires are in the wrong order, we would need a swap to bring them back to the right order for the questions. Remember, we already created assertion boxes that are also equiped with swaps for this purpose).\n", | ||
"\n", | ||
"Notice that, throughout the next cell, we always have two circuits. The circuit names ending in \"pos\" signal the circuits corrsponding to the affirmative assertions, while their counterparts ending in \"neg\" signal the ones corresponding to the negative assertions.\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make more sense to use 'aff' for affiramtive instead of 'pos'?
I think the three paragraphs can be shortened to a simple statement saying that the wires have to match. If the location comes before the person in the text circuit then a swap is needed for the question to be composed to the correct wires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Should I delete the paragraph that explains the implementation details (the discards and the swaps), and just keep it simple by saying that the wires have to match and they can come in any order so we have to provide boxes that accommodate both possible cases?
…ies and the one on reqriting the three paragraphs justifying the swaps
Hi @ouissal-moumou, does this PR supersede the previous PR #42. I see that both refer to using the experimental DisCoCirc module on the bAbI6 task? EDIT: Closed the other ticket as all work done is on this PR now. |
…s have been addressed + more changes to wording
…the works since there seems to be a problem with Lambeq
Further changes have been made to the data-prep part of the tutorial both in terms of the text and code. The training notebook is still under investigation since it seems there is a bug in Lambeq. In a nutshell, it seems that reproducibility is only possible when we make sure that the training dataset is unshuffled. |
"Now that we already have the circuits representing the texts, we need to make the circuits representing the assertions. Remember, in our experiment, we need to have a pair of circuits, one for the affirmative case, and the other for the negative case. However, when adding the box corresponding to the assertion, we have to make sure that the wires of the assertion box match with the wires representing the nouns from the text. \n", | ||
"\n", | ||
"Below, the function `return_noun_list` returns all the nouns in a text. The function `return_q_nouns` return all the nouns in a question. In the latter, we take the third and sixth word as the person and location in the question respectively. This works because of the simple case of the bAbI6 experiments, all the questions are of the format \"Is the person in the location?\".\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the dataset has Questions e.g "Is Emily in the kitchen?" so slight rephrasing needed with no 'the' in front of 'person' for the question 'Is person in the location? ' and then we have second and fifth words of the question. This also corresponds to the code using index 1 and 4.
The bAbI6 Tutorial with DisCoCirc
This PR introduces the bAbI6 tutorial. Please add any comments or questions if you have.