Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voyager support in Scio #4996

Merged
merged 28 commits into from
Oct 10, 2023
Merged

Voyager support in Scio #4996

merged 28 commits into from
Oct 10, 2023

Conversation

patrickwmcgee
Copy link
Contributor

@patrickwmcgee patrickwmcgee commented Sep 12, 2023

This adds in support for Voyager to Scio, the api is modeled after the existing Annoy implementation since they're both Nearest Neighbor lookups.

The integration tests are still a work in progress as I couldn't get them to run locally and will iterate on them here (since CI should be able to properly run them with permissions) but the rest of the code should have basic unit tests and I've done integration style tests in dataflow to

  1. write an index
  2. read an index and query it and
  3. write a temporary index and query it.

@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Merging #4996 (7902583) into main (9eb0d59) will increase coverage by 0.03%.
Report is 12 commits behind head on main.
The diff coverage is 67.46%.

❗ Current head 7902583 differs from pull request most recent head 8a4d84b. Consider uploading reports for the commit 8a4d84b to get more accurate results

@@            Coverage Diff             @@
##             main    #4996      +/-   ##
==========================================
+ Coverage   63.12%   63.15%   +0.03%     
==========================================
  Files         283      286       +3     
  Lines       10659    10739      +80     
  Branches      765      767       +2     
==========================================
+ Hits         6728     6782      +54     
- Misses       3931     3957      +26     
Files Coverage Δ
...ify/scio/io/dynamic/syntax/SCollectionSyntax.scala 90.38% <ø> (ø)
...cala/com/spotify/scio/grpc/SCollectionSyntax.scala 100.00% <100.00%> (ø)
...sdk/extensions/smb/ParquetTypeFileOperations.scala 96.66% <100.00%> (+0.11%) ⬆️
.../scio/extra/voyager/syntax/ScioContextSyntax.scala 0.00% <0.00%> (ø)
...scala/com/spotify/scio/extra/voyager/Voyager.scala 66.66% <66.66%> (ø)
.../scio/extra/voyager/syntax/SCollectionSyntax.scala 50.00% <50.00%> (ø)

... and 1 file with indirect coverage changes

@RustedBones RustedBones marked this pull request as draft September 13, 2023 10:14
@RustedBones
Copy link
Contributor

Thanks @patrickwmcgee . I just converted the PR to draft until this gets ready. Will have a look at it ASAP

build.sbt Outdated Show resolved Hide resolved
* the `index.hnsw` and `names.json` are.
*/
trait VoyagerUri extends Serializable {
val logger = LoggerFactory.getLogger(this.getClass)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer logger declaration in the companion object, with @transient lazy val to avoid serialization

*/
trait VoyagerUri extends Serializable {
val logger = LoggerFactory.getLogger(this.getClass)
val path: String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer def in traits

}

def files: Seq[String] = Seq("index.hnsw", "names.json")
implicit val voyagerUriCoder: Coder[VoyagerUri] = Coder.kryo[VoyagerUri]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could probably do better than kryo if VoyagerUri is a sealed trait and implementations case classes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if you could follow up on this with more of an example?

Copy link
Contributor

@clairemcginty clairemcginty Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! basically, Scio can leverage Magnolia to automatically derive compile-time Coders for Scala ADTs: case classes, and sealed traits (sealed is important so that Magnolia knows in advance every possibility it can expect when it decodes an instance of the trait). These types of coders are much more efficient than Kryo, which basically traverses the class fields at runtime.

So, you could have a setup like this:

sealed trait VoyagerUri {
   def path: String
   ...
}

case class LocalVoyagerUri(path: String) extends VoyagerUri {
   override def....
}

case class RemoteVoyagerUri(path: String, private rfu: RemoteFileUtil) extends VoyagerUri {
   override def ...
}

object RemoteVoyagerUri {
  def apply(path: String, options: PipelineOptions): RemoteVoyagerUri =
   RemoteVoyagerUri(path, RemoteFileUtil.create(options))
}

Then you can remove the implicit Kryo coder, and it should just work ™️ .

We even have a specialized type of unit test assertion for checking that your Coder doesn't fall back to kryo:

import com.spotify.scio.testing.CoderAssertions._

class VoyagerTest extends PipelineSpec {
   "VoyagerUri" should "not use Kryo" in {
      val localUri: VoyagerUri = LocalVoyagerUri("blah")
      localUri coderShould notFallback()

      val remoteUri: VoyagerUri = RemoteVoyagerUri("blah",  PipelineOptionsFactory.create())
      remoteUri coderShould notFallback()
   }
}

(more examples in CoderTest).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored this so we do not have ADT. VoyagerUri is a simple value-class around a normal URI. the RemoteFileUtil can be passed as implicit paramer to exist

The other functions should not de defined into the model. Moved them to the side-input or the dedicated asVoyager SCOllection operation

@RustedBones
Copy link
Contributor

You need to run headerCreateAll to make the CI happy :)

@patrickwmcgee patrickwmcgee marked this pull request as ready for review September 26, 2023 20:13
}

override private[voyager] def saveAndClose(w: VoyagerWriter): Unit = {
val tempPath: Path = Files.createTempDirectory("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val tempPath: Path = Files.createTempDirectory("")
val tempPath: Path = Files.createTempDirectory("voyager-")


override private[voyager] def saveAndClose(w: VoyagerWriter): Unit = {
val tempPath: Path = Files.createTempDirectory("")
logger.info(s"temp path: $tempPath")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info(s"temp path: $tempPath")

I would move this to line 106, so we can log it as "Uploaded Voyager file from $tempPath to $path/$f" 👍

): SCollection[VoyagerUri] = {
val uuid: UUID = UUID.randomUUID()
val tempLocation: String = self.context.options.getTempLocation
require(tempLocation != null, s"--tempLocation arg is required")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require(tempLocation != null, s"--tempLocation arg is required")
require(tempLocation != null, s"Voyager writes require --tempLocation to be set.")

) extends SideInput[VoyagerReader] {
override def get[I, O](context: DoFn[I, O]#ProcessContext): VoyagerReader = {
val uri = context.sideInput(view)
VOYAGER_URI_MAP.synchronized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain the synchronized map use case a bit? I guess it would be if a user tried to load two Voyager SideInputs w/ the same URI at different nodes of the job graph... is loading the reader a really expensive operation itself, like does it load everything into memory at construction time, or is it just a pointer to a URI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the loading is really expensive and having multiple nodes / threads try to create a reader concurrently will cause the job to OOM. This sort of alleviates that and lets it be a singleton of uri -> reader at least for constructing and loading it into memory.

@clairemcginty clairemcginty added this to the 0.13.4 milestone Sep 26, 2023
patrickwmcgee and others added 2 commits September 27, 2023 11:01
…ge.scala

Co-authored-by: Claire McGinty <claire.d.mcginty@gmail.com>
@RustedBones
Copy link
Contributor

Took the freedom to refactor the PR to avoid making the same design mistakes which were done in the Annoy module

@RustedBones RustedBones merged commit b3215c6 into spotify:main Oct 10, 2023
11 checks passed
@patrickwmcgee patrickwmcgee deleted the voyager branch October 11, 2023 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants