-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Animorphs: The Lost Chapters [Completed] #66
Comments
Dev DiaryObtaining the corpus
// runner.js
const fs = require("fs");
const pdf = require("pdf-parse");
const fileNames = fs.readdirSync("./PDF", (err, files) => {
return files;
});
fileNames.forEach((file) => {
let targetInput = `./PDF/${file}`;
let targetOutput = "./texts/" + file.substring(0, file.length - 4) + ".txt";
let dataBuffer = fs.readFileSync(targetInput);
pdf(dataBuffer)
.then((data) => {
fs.writeFile(targetOutput, data.text, (err) => {
if (err) throw err;
console.log(`${targetOutput} has been created!`);
});
})
.catch((e) => {
console.log(e);
});
}); ObstacleThe code would not execute from the command prompt
$ cat texts/*.txt > animorphs-corpus.txt And that's all the code we have to write for now! Let's get a word count: $ wc -w animorphs-corpus.txt
1725941 animorphs-corpus.txt 1.7 million words! The file size is 9.2M. We will now take this and feed it to |
|
Fine-tuning GPT-2This tutorial is pretty spot on and I've used it before to train on other data sets (Taco-related, Poetry, Gothic Lit). The main thing is that I want to ensure that my generated text has the right "flavor", so I chose to tune for 2000 steps, twice as many of the recommended steps. Training for 1000 steps takes about 45 minutes, so 2000 will take about an hour and a half. |
Sample output from Step 1850
|
gpt-2-simple documentationTake advantage of the "cloud" resources available while working on this Machine Learning project. Procedural text generation is a wonderful gateway into Natural Language Processing. So check out the documentation - you could possibly use the package elsewhere! |
Findings from Documentation
|
First GPT-2 OutputOkay! Let's make some invocations. I will use the prefix of the very first book, Input Parametersgpt2.generate(sess,
length=1024,
temperature=0.7,
prefix="My name is",
nsamples=5,
batch_size=5
) Output Results
|
Analysis of First GPT-2 OutputLet's grab some statistics from a free online Word Counter Tool StatisticsDetails
Keyword Density x1
Keyword Density x2
Keyword Density x3
|
Second GPT-2 OutputThis time, I will not give it a prefix prompt, and let it write freely on its own. Input Parametersgpt2.generate(sess,
length=1024,
temperature=0.7,
# prefix="Chapter ",
nsamples=5,
batch_size=5
) Output Results
|
Analysis of Second GPT-2 OutputLet's grab some more statistics again from our favorite free online Word Counter Tool StatisticsDetails
Keyword Density x1
Keyword Density x2
Keyword Density x3
|
Scaling UpIn the previous parameter inputs, asking for
By batch generating 85 |
Getting to 50,000 wordsGPT-2 has recognized "Chapter X" as part of its output. So we will add a little snippet to procedurally generate a chapter title and use it as a delimiter instead of the default
Let's modify the batch generation parameters to fit our needs now. gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())
import random
chapter_title = 'Chapter ' + str(random.randrange(31))
gpt2.generate_to_file(sess,
destination_path=gen_file,
length=1023,
temperature=0.7,
nsamples=85,
batch_size=20,
sample_delim=chapter_title
) |
DebuggingErrorAssertionError Traceback (most recent call last)
<ipython-input-11-de1a1b92f5a6> in <module>()
11 nsamples=85,
12 batch_size=20,
---> 13 sample_delim=chapter_title
14 )
1 frames
/usr/local/lib/python3.6/dist-packages/gpt_2_simple/gpt_2.py in generate(sess, run_name, checkpoint_dir, model_name, model_dir, sample_dir, return_as_list, truncate, destination_path, sample_delim, prefix, seed, nsamples, batch_size, length, temperature, top_k, top_p, include_prefix)
426 if batch_size is None:
427 batch_size = 1
--> 428 assert nsamples % batch_size == 0
429
430 if nsamples == 1:
AssertionError: Cause
Solutiongen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())
import random
chapter_title = 'Chapter ' + str(random.randrange(31))
gpt2.generate_to_file(sess,
destination_path=gen_file,
length=1023,
temperature=0.7,
nsamples=80,
batch_size=20,
sample_delim=chapter_title
) |
First Attempt at 50,000 wordsDownload the batch generationWe fixed the bug! Now for the moment of truth files.download(gen_file) Unix command line to find Word CountHow many words do we have? $ wc -w gpt2_gentext_20201127_064236.txt
49266 gpt2_gentext_20201127_064236.txt We are just 734 words short! Let's create one more generation of # previous code
nsamples=1,
batch_size=1,
# then
files.download(gen_file) $ wc -w gpt2_gentext_20201127_065328.txt
750 gpt2_gentext_20201127_065328.txt 750! Nice. Now we have over 50,000 words! But they are in two files. So let's concatenate them into one file. We used a similar strategy earlier, with a glob. So we'll do it again here. $ cat gpt2*.txt > animorphs-the-lost-chapters.txt
$ wc -w animorphs-the-lost-chapters.txt
50015 animorphs-the-lost-chapters.txt We did it!!! |
Analysis of
|
And skimming through all of the instances of the word "Chapter", there are 156 chapters! |
Read the first draft here! |
There's a few more days before this ends, so I will try to think of anything else I'd like to do. One thing that stands out is generating additional texts in order to "curate" a better novel. So in the current first draft there are chapters with a lot of "repetitive text content". Ideally, I could delete those sections, generate more text, and append / edit it. |
There are almost 5000 lines of text in this novel. It shouldn't take too long to scroll and "prune" |
But I would also be interested in generating more text. |
Maybe running some NLP tools on the original corpus and comparing it to what GPT-2 did. |
Or even have GPT-2 generate a million-word corpus to rival the source material! 1,449,150 words, according to reddit And 1.7 by my shell scripts. So around one and a half million words! We would need an |
I noticed that I was missing a newline character in my chapter_title = 'Chapter ' + str(random.randrange(31)) + '\n' I will use it to generate the 1 million-some words. nsamples=2500,
batch_size=20, |
I would consider re-training GPT-2 to recognize Chapters by labelling each chapter from the Corpus data set.
|
Second Draft ConsiderationsThere were a couple of things in the first draft that I think can be improved.
Mid Level DesignPreparing the raw data for processingThe biggest task would be to label all of the book chapters with a "bespoke character sequence to indicate the beginning and end of a document."
The token sequences will be
After labeling the data, I will start a new instance of BONUS: Cross-training the deep learning model with fanfictionAccording to another reddit post from the Animorphs subreddit, there is a fanfiction that has a word count surpassing the first 13 books. Animorphs: The Reckoning is a fanfic with 64 chapters and 545,625 words! Over half a million words - that's almost 11 extra novels! This would be a cool "stretch goal", to "cross-train" the main corpus with a giant fanfic and see what its the output! |
Easter EggA Letter to the Fans by K.A. Applegate
|
I started training a new GPT-2-Simple instance with this labeled data set at 1:00pm. I set it at 2000 steps, and should be done around 2:30! |
Introducing More VariablesThe
|
|
ObstacleI walked away from my computer to let it work idly. It went to sleep and terminated its connection to the Google Colaboratory VM. So I'm changing my computer settings to NOT DO THAT and try it again. It's risky asking for 2000 all at once. So I will ask for 4 batches of 500. |
ObstacleMy internet connection has become very spotty and I keep disconnecting from the runtime. I will try now just using smaller If I ask for ApproachI am just going to ask twice. I already have a million words from the first iteration. (Didn't post about this earlier). And asking twice has yielded me 141,627 words to curate from! |
First Impressions on Data CollectionWe have 141,627 words generated in a 14,870 lines document! $ cat ani2-raw.txt | grep 'Chapter' | wc -l
352 And it looks like there are 352 chapters. Labeling the Initial Data Produced Better ResultsPretty much this entire document is arranged in to Chapters! Looking good! I think a programmatic approach to curation is the next call. High level design
I can do step 1 in my text editor, with a find and replace all. |
Organizing the DataWe're going to use Node.js again for file manipulation. // chapterize.js
const fs = require("fs");
let a;
fs.readFile("ani2-cleaned.txt", "utf8", (err, data) => {
if (err) throw err;
a = data;
});
const chapters = a.split("Chapter");
chapters.forEach((chapter, i) => {
chapterData = `Chapter${chapter}`;
let fileName = `ani2-${i + 1}.txt`;
let targetOutput = `./gen_chapters/${fileName}`;
fs.writeFile(targetOutput, chapterData, "utf8", (err) => {
if (err) throw err;
console.log(`${fileName} was created.`);
});
});
console.log(`${chapters.length} files created.`); $ mkdir gen_chapters ObstacleThe code would not execute from the command prompt ResultsNow all of the chapters are organized into 353 separate files! Next StepsI can programmatically curate 50,000 words from these files. I need to come up with a strategy for doing that. |
Filtering the DataI want to know the distribution of "Chapter Numbers" in the 353 generated files. $ cat ani2-unraw.txt | grep 'CHAPTER' | cat > chapters.txt After a little text editor action to clean the data, let's use some more JavaScript to analyze the chapter distribution! Analyzing Generated Chapter Distributions// chapter-counter.js
const fs = require("fs");
const a = fs.readFile("chapters.txt", "utf8", (err, data) => {
if (err) throw err;
return data;
});
const chapterNumbers = a.split("\n");
const countOccurrences = (arr) =>
arr.reduce((prev, curr) => ((prev[curr] = ++prev[curr] || 1), prev), {});
console.log(countOccurrences(chapterNumbers)); $ node chapter-counter.js Results{
'1': 5,
'2': 1,
'3': 16,
'4': 10,
'5': 29,
'6': 16,
'7': 9,
'8': 18,
'9': 19,
'10': 10,
'11': 11,
'12': 39,
'13': 14,
'14': 17,
'15': 15,
'16': 18,
'17': 12,
'18': 23,
'19': 11,
'20': 8,
'21': 21,
'22': 9,
'23': 7,
'24': 3,
'25': 2,
'26': 3,
'27': 3,
'35': 1,
'39': 2,
'': 1
} So there is a nice distribution! It generated Chapters 1-27, and some odd ones at 35 and 39. And there's an "orphan" chapter. Of interesting note, there is exactly ONE |
can I submit a pull request to subtitle this 10,000 Bowls of OAT-freaking-MEAL |
Haha. I had to look up the passage.
I think that would be a cool little blurb on the book cover :) Something like
|
Summarizing the DataRemember that we have 353 generated chapters organized into a folder, let targetOutput = `./gen_chapters/${fileName}`; Get the word count of each chapterLet's loop over every file in that directory and get a word count from each of them. $ for i in gen_chapters/*.txt; do wc -w $i; done
457 gen_chapters/ani2-10.txt
227 gen_chapters/ani2-100.txt
278 gen_chapters/ani2-101.txt
... Very nice! Let's sort this with a pipe operator! The $ for i in gen_chapters/*.txt; do wc -w $i; done | sort -r
889 gen_chapters/ani2-259.txt
849 gen_chapters/ani2-332.txt
802 gen_chapters/ani2-141.txt
...
7 gen_chapters/ani2-148.txt
2 gen_chapters/ani2-241.txt
1 gen_chapters/ani2-1.txt
Wow! This is a really cool spread! Let's save it into a file. $ for i in gen_chapters/*.txt; do wc -w $i; done | sort -r | cat > word-count.txt (Wow, I can't believe that worked!) |
Getting to 50,000 wordsWith some text editor
Let's write another JavaScript program that will:
We will use the
// word-counter.js
const fs = require("fs");
const parse = require("csv-parse");
// https://stackoverflow.com/questions/2450954/how-to-randomize-shuffle-a-javascript-array
const getShuffledArr = (arr) => {
if (arr.length === 1) {
return arr;
}
const rand = Math.floor(Math.random() * arr.length);
return [arr[rand], ...getShuffledArr(arr.filter((_, i) => i != rand))];
};
var wordCounter = 0;
var chapters = [];
const counter = (err, records) => {
for (let [index, record] of records.entries()) {
let wordCount, chapter;
[wordCount, chapter] = record;
wordCounter += parseInt(wordCount);
chapters.push(chapter);
if (wordCounter >= 50_000) {
const message =
`Task complete! ` +
`There are ${chapters.length} chapters and ${wordCounter} total words.` +
`\n` +
`One last step! Get the book by using this command in the terminal: ` +
`\n` +
`\n` +
`cp ${getShuffledArr(chapters).join(" ")} lost_chapters/ ` +
`&& cat lost_chapters/*.txt animorphs-the-lost-chapters-final.txt`;
console.log(message);
break;
}
}
};
const parser = parse(counter);
fs.createReadStream("word-count.csv").pipe(parser); A little bit of meta-programming! I wrote code to write code for me! Every time this script is run, it will generate a different ordered output.
I definitely ran that script several times to see the different outputs. Woo! Let's use that new command now! |
Running the CodeThe input took two minutes to buffer into the prompt! Note to self: the next time I try to generate code for the shell, just write it into a script file. BUG (fixed)
I was missing a WOW! That was instant!!! Check out those last sentences!
|
Animorphs: The Lost ChaptersIt is now complete! |
Congratulations! Is the code in the comments here, or in a repo? (Fine either way, just checking :) |
The code is all in here, in the style of "stream of consciousness." |
Good stuff, have a |
Content warning
Animorphs is a war story full of tragedy and trauma. There are graphic depictions of bodily injuries and mutations.
Images generated with Text to Image API
Animorphs: The Lost Chapters
A 50,500 word novel generated with
gpt-2-simple
finetuned on the entire Animorphs book series (2000 steps).Excerpt
Read it here
Download it
Here are some initial statistics
Keyword Density x1
Keyword Density x2
Keyword Density x3
I'll be back next year!
Next I will be working on mining the novel for data visualization. This is a little bit outside the scope of NaNoGenMo, so for now I will just include some initial statistics above. This was a fun project! I've been looking forward to this for months!
Submission [Completed]
I'm making an entry before the month is out! I've been looking forward to this for a long time. :)
I will use the Animorphs corpus of over one million words written by K.A. Applegate and various ghostwriters to train GPT-2 and ask it to write me a novel.
I'm using a mix of JavaScript and Python and Shell Scripting to accomplish this task.
High-level design overview
Read previous drafts here!
First Draft
Second Draft
The text was updated successfully, but these errors were encountered: