Almost-but-not-quite #10

MichaelPaulukonis · 2016-10-24T12:52:17Z

My main project will be to complete an npm module for getting texts that are almost-but-not-quite the same as the source text.

The idea is rougly the same as @dariusk's Harpooners and Sailors (here (source) and here (output+notes)) from last year - but wrapped up into a nice reusable package.

I think I would like to use such a module for other projects, so this is a good time to git-r-done.

Plus, I've been holding off the implementation of it until November, anyway.

MichaelPaulukonis · 2016-10-27T13:41:53Z

Link Dump

https://github.com/sindresorhus/leven
https://github.com/NaturalNode/natural#string-distance I worked with Natural, although one of my latest non-browser projects is using nlp_compromise (NLP compromised to make it small and fast enough for the browser) for reasons I can't remember.

MichaelPaulukonis · 2016-11-07T18:29:43Z

start of crude proof-of-concept code here.

Includes some not-quite-as-crude code from another project I've done.

Which uses the nlp-compromise package, instead of natural. I'm going to look into swapping those out.

MichaelPaulukonis · 2016-11-08T16:15:39Z

Sooooooo.... the light dawns on Marblehead: I'm using Levenshtein (edit-distance), wheras Kazemi used Word2Vec - which gives a semantic distance. Edit-distance is purely an accident of orthography.

So, what I've got is not nearly as interesting as I was hoping for (as usual).

It is of some interest, and I'll post some examples later this week (I'm desperately short on time this year, le sigh).

enkiv2 · 2016-11-08T18:03:47Z

If you could normalize both to a scale between 0 and 1 you could multiply
them :)

On Tue, Nov 8, 2016 at 11:15 AM Michael Paulukonis notifications@github.com
wrote:

Sooooooo.... the light dawns on Marblehead: I'm using Levenshtein
(edit-distance), wheras Kazemi used Word2Vec - which gives a semantic
distance. Edit-distance is purely an accident of orthography.

So, what I've got is not nearly as interesting as I was hoping for (as
usual).

It is of some interest, and I'll post some examples later this week
(I'm desperately short on time this year, le sigh).

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#10 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAd6GYmsBrf2Y5MTyCMG5MqtoLflj0YRks5q8KAsgaJpZM4KewBN
.

MichaelPaulukonis · 2016-11-09T21:02:58Z

I think I'm going to do some overkill and play with retext and the nodes of its natural language concrete syntax tree. Which has some charms as paragraph and sentence tokenization, and the ability to recreate the original text.

I find the online examples of using retext and nlcst to be sub-optimal.

Also, I'm curious why the project works asynchronously, when there are no asynchronous sub-elements.

MichaelPaulukonis · 2016-11-11T16:17:56Z

@enkiv2 - What would that do? Pretend I'm almost statistically innumerate....

There are libs that provide a 0..1 edit distance; I happened to pick a package that didn't.

We've got a baby coming in < 3 weeks, so I'm not going to get into too much craziness. Figuring out how to get retext going seems to be the high-point of the month for me.

enkiv2 · 2016-11-11T17:18:24Z

If you had the two factors scaled the same way, and multiplied them, you
would rank words that are a good match on both factors much higher than one
that is a good match on one but a poor match on the other. So, you'd get a
lot of heavily related words. The results might be much more interesting,
or much less interesting; I'm not sure.

On Fri, Nov 11, 2016 at 11:17 AM Michael Paulukonis <
notifications@github.com> wrote:

@enkiv2 https://github.com/enkiv2 - What would that do? Pretend I'm
almost statistically innumerate....

There are libs that provide a 0..1 edit distance; I happened to pick a
package that didn't.

We've got a baby coming in < 3 weeks, so I'm not going to get into too
much craziness. Figuring out how to get retext going seems to be the
high-point of the month for me.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#10 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAd6GRcZ5D1OpnyE_MEeXoyI2vHv1D6yks5q9JU1gaJpZM4KewBN
.

MichaelPaulukonis · 2016-11-18T15:51:34Z

@enkiv2 we're ranking sentences, not words. I'm still not clear on what I would multiply.

Here is some sample output

It only took 11 hours, but that's also because the computer slept for much of that time.

enkiv2 · 2016-11-18T16:32:25Z

I guess if we're ranking sentences that's a much harder problem. I don't
know how to get, say, a word2vec-style location in semantic space for a
whole sentence. Adding all the vectors would probably produce some
unrelated word, if anything.

On Fri, Nov 18, 2016 at 10:51 AM Michael Paulukonis <
notifications@github.com> wrote:

@enkiv2 https://github.com/enkiv2 we're ranking sentences, not words.

I'm still not clear on what I would multiply.

Here is some sample output
https://gist.github.com/MichaelPaulukonis/2b2d47a5e22066e950c39841b9a6c889

It only took 11 hours, but that's also because the computer slept for much
of that time.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#10 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAd6GTRI06wI6KmJ6bBXifrniEG5fsLkks5q_cmGgaJpZM4KewBN
.

ikarth · 2016-11-18T17:49:32Z

There's been some work with vectors at the sentence, paragraph, and document level. Look into doc2vec.

MichaelPaulukonis · 2016-11-18T18:12:53Z

Kazemi's project last year used word2vec - which I missed when I started the project. I was trying to do a single-language (NodeJS) solution. Not quite possible.

michelleful · 2016-11-20T06:05:50Z

@enkiv2, you may want to give skip-thought vectors a try.

MichaelPaulukonis · 2016-11-29T21:50:28Z

@ikarth part of this was NOT using doc2vec since that's not NodeJS. Another part was thinking that Kazemi had not used it, either.

Something I did discover is some word-vectors as JSON - https://igliu.com/word2vec-json/

I'm going to call it quits for the month. I've got a novel, I didn't hit my objective of a nicely packaged npm module, but I did generate a novel and learned new things.

We've got another baby due on Dec 1, so I'm going to finish off the month focusing on that!

The entire novel has been appended to gist @ https://gist.github.com/MichaelPaulukonis/2b2d47a5e22066e950c39841b9a6c889

MichaelPaulukonis changed the title ~~Close-but-not-quite~~ Almost-but-not-quite Nov 9, 2016

MichaelPaulukonis added the preview label Nov 18, 2016

hugovk mentioned this issue Nov 29, 2016

Language survey 2016 #51

Open

hugovk added the completed label Nov 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Almost-but-not-quite #10

Almost-but-not-quite #10

MichaelPaulukonis commented Oct 24, 2016 •

edited

Loading

MichaelPaulukonis commented Oct 27, 2016

MichaelPaulukonis commented Nov 7, 2016

MichaelPaulukonis commented Nov 8, 2016

enkiv2 commented Nov 8, 2016

MichaelPaulukonis commented Nov 9, 2016 •

edited

Loading

MichaelPaulukonis commented Nov 11, 2016

enkiv2 commented Nov 11, 2016

MichaelPaulukonis commented Nov 18, 2016

enkiv2 commented Nov 18, 2016

I'm still not clear on what I would multiply.

ikarth commented Nov 18, 2016

MichaelPaulukonis commented Nov 18, 2016

michelleful commented Nov 20, 2016

MichaelPaulukonis commented Nov 29, 2016

Almost-but-not-quite #10

Almost-but-not-quite #10

Comments

MichaelPaulukonis commented Oct 24, 2016 • edited Loading

MichaelPaulukonis commented Oct 27, 2016

Link Dump

MichaelPaulukonis commented Nov 7, 2016

MichaelPaulukonis commented Nov 8, 2016

enkiv2 commented Nov 8, 2016

MichaelPaulukonis commented Nov 9, 2016 • edited Loading

MichaelPaulukonis commented Nov 11, 2016

enkiv2 commented Nov 11, 2016

MichaelPaulukonis commented Nov 18, 2016

enkiv2 commented Nov 18, 2016

I'm still not clear on what I would multiply.

ikarth commented Nov 18, 2016

MichaelPaulukonis commented Nov 18, 2016

michelleful commented Nov 20, 2016

MichaelPaulukonis commented Nov 29, 2016

MichaelPaulukonis commented Oct 24, 2016 •

edited

Loading

MichaelPaulukonis commented Nov 9, 2016 •

edited

Loading