Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Quantum Supposition of Oz #137

Open
spc476 opened this issue Nov 30, 2014 · 7 comments
Open

The Quantum Supposition of Oz #137

spc476 opened this issue Nov 30, 2014 · 7 comments

Comments

@spc476
Copy link

spc476 commented Nov 30, 2014

A Markov chain of order-3 based on the Oz novels written by L. Frank Baum (14 novels in total). The only unusual thing here is that I considered punctuation as "words" in addition to the end-of-paragraph, so that you don't get a "wall of text" but something that is a bit more readable (even if the punctuation is separated by space when it shouldn't be).

The code is github: https://github.com/spc476/NaNoGenMo-2014 and the sample novel can be read at https://github.com/spc476/NaNoGenMo-2014/blob/master/TheQuantumSuppositionOfOz.txt

And my blog entry that goes into more detail about how it works: http://boston.conman.org/2014/11/29.1

@spc476 spc476 closed this as completed Nov 30, 2014
@spc476 spc476 reopened this Nov 30, 2014
@cpressey
Copy link

Nice technique for handling the punctuation -- it does make it more coherent(-seeming) than a run-of-the-mill Markov chain.

It should be possible to clean up the intervening spaces with a postprocessor... I wrote one (here) for my own novel, but admittedly I didn't have quotation marks to deal with.

@ikarth
Copy link

ikarth commented Dec 1, 2014

I ran into a similar issue with punctuation last year and ended up solving it with a postprocessing step. I'm starting to think that it makes sense to have the generator emit marked-up XML or something and then run clean-up on it as a matter of course.

@cpressey
Copy link

cpressey commented Dec 2, 2014

Outputting some kind of tree structure (like XML) and then flattening it (sensibly) is a good approach.

On the other hand, this level of punctuation/spacing messiness is nothing a few rewriting rules can't clean up.

Given that this seems to be a "problem" that several participants have encountered, I'm working on generalizing the code I wrote into a proper reusable tool of some sort. (nice change to be doing engineering again after all that hackery science, too)

Here's what it does, so far, on an excerpt from The Quantum Supposition of Oz:

“Please tell Ozma, Dorothy, and when I visit Ozma she sometimes allows
me to ride upon his back, one seat for each member of the council. The”
H. M. “meant Highly Magnified, if you like,” said he.

“I dunno where this tunnel in the mountain he said to himself:

“Do,” said Nikobob, “said the stuffed one, seriously.

“I've forgotten, and I'm surprised that I was not a live thing; you're a
dummy.”

“It's just nonsense!” declared Dorothy.

(I love that last line :)

I don't know how long I'll spend on perfectionistically engineering this, but I'm hoping to end up with something like BeautifulSoup except for plain text.

If I'm happy with it before 11 more months have passed, I'll announce it on next year's Resources issue :)

@enkiv2
Copy link

enkiv2 commented Dec 2, 2014

I've had good luck in the past treating punctuation as its own token, then
normalizing with sed 's/ *([.,?!:;]) */\1 /g;s/ *([([])
*([A-Za-z0-9])/\1\2/g;s/([A-Za-z0-9]) *([)]]) */\1\2/g' -- in other
words, left-aligning all the stops and the right-hand grouping symbols and
right-aligning the left-hand grouping symbols. Then, you need another stage
for handling quotes -- but without balancing, that's more of a pain.

On Tue Dec 02 2014 at 6:04:28 AM Chris Pressey notifications@github.com
wrote:

Outputting some kind of tree structure (like XML) and then flattening it
(sensibly) is a good approach.

On the other hand, this level of punctuation/spacing messiness is nothing
a few rewriting rules can't clean up.

Given that this seems to be a "problem" that several participants have
encountered, I'm working on generalizing the code I wrote into a proper
reusable tool of some sort. (nice change to be doing engineering again
after all that hackery science, too)

Here's what it does, so far, on an excerpt from The Quantum Supposition
of Oz
:

“Please tell Ozma, Dorothy, and when I visit Ozma she sometimes allows
me to ride upon his back, one seat for each member of the council. The”
H. M. “meant Highly Magnified, if you like,” said he.

“I dunno where this tunnel in the mountain he said to himself:

“Do,” said Nikobob, “said the stuffed one, seriously.

“I've forgotten, and I'm surprised that I was not a live thing; you're a
dummy.”

“It's just nonsense!” declared Dorothy.

(I love that last line :)

I don't know how long I'll spend on perfectionistically engineering this,
but I'm hoping to end up with something like BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/ except for plain text.

If I'm happy with it before 11 more months have passed, I'll announce it
on next year's Resources issue :)


Reply to this email directly or view it on GitHub
#137 (comment)
.

@MichaelPaulukonis
Copy link

A different approach to markov tokenization - I've worked with punctuation before in different ways, but for text blobs, so I never had to worry about the spacing. I appreciated the links to Racter/PBiHC, since I hadn't seen the template details before.

@spc476
Copy link
Author

spc476 commented Dec 2, 2014

You're welcome. It's surprising there's so little information about Racter out there (and according to Google, I appear to be one of the experts about Racter---sigh). The source to Racter is out there, but what is there appears to be the post-processed output from INRAC, a custom language used to write Racter. It's bizarre (http://boston.conman.org/2008/06/18.2).

@cpressey
Copy link

cpressey commented Dec 2, 2014

That... is actually a pretty nifty control structure. "Find all labels that match this pattern, then pick one of those labels at random and call it."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants