Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/geopackage reader #1603

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

Imbruced
Copy link
Member

@Imbruced Imbruced commented Sep 24, 2024

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

  • No

What changes were proposed in this PR?

Geopackage datasource

How was this patch tested?

integration tests

Did this PR include necessary documentation updates?

  • I ll adjust the docs

@Imbruced
Copy link
Member Author

As a follow up to this mr, I need to add

  • ewkb parsing (I think reading from circular geometries wont work)
  • support for parallel loading via limit and offset
  • add kindoff filtering based on envelope offset
  • direct transformation from geopackage serialization to sedona

@Kontinuation
Copy link
Member

Kontinuation commented Sep 25, 2024

Thank you for this great work! My major concern is whether it works with GeoPackage files stored on cloud storage such as HDFS or S3. As far as I know org.xerial:sqlite-jdbc only works with local files, so we may have to download the GeoPackage file from cloud storage to the local file system before reading it.

@Imbruced
Copy link
Member Author

@Kontinuation good point, I ll write the test to make sure it works. Integration test with minio would be enough I think.

@Imbruced
Copy link
Member Author

WIP

@github-actions github-actions bot added the docs label Sep 30, 2024
@Imbruced Imbruced marked this pull request as ready for review October 1, 2024 21:15
Comment on lines 49 to 60
// skip srid for now
reader.getInt()

skipEnvelope(resolvedFlags._1, reader)

val wkb = new Array[Byte](reader.remaining())
reader.get(wkb)

val wkbReader = new org.locationtech.jts.io.WKBReader()
val geom = wkbReader.read(wkb)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we take SRID into consideration. val wkbReader = new WKBReader(new GeometryFactory(new PrecisionModel(), srid)) would be sufficient.

Comment on lines 83 to 92
if (pathString.toLowerCase(Locale.ROOT).endsWith(".geopackage")) {
val path = new Path(pathString)
val fs = path.getFileSystem(hadoopConf)

val isDirectory = Try(fs.getFileStatus(path).isDirectory).getOrElse(false)
if (isDirectory) {
pathString
} else {
pathString.substring(0, pathString.length - 3) + "???"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, if pathString ends with ".geopackage", and it is not a directory, it will be transformed to "****.geopack???". I cannot grasp the idea of this transformation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you are right, I was sure that in this part I already have list of files in the directory, I am wondering how it would behave if I am passing list of files and if its actually needed (user specifying paths with different file formats).

val serializableConf = new SerializableConfiguration(
sparkSession.sessionState.newHadoopConfWithOptions(options.asScala.toMap))

val tempFile = FileSystemUtils.copyToLocal(serializableConf.value, files.head.getPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we detect if the path is a local path skip calling copyToLocal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sure

@Imbruced
Copy link
Member Author

Imbruced commented Oct 2, 2024

@Kontinuation thanks for the review !

@@ -98,6 +97,36 @@
<groupId>org.locationtech.jts</groupId>
<artifactId>jts-core</artifactId>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why these dependencies only appear in the spark-3.3 profile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test testing loading from s3 (I used minio with test containers) is only in spark 3.3. I can duplicate it to other versions of spark as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these to the spark-common pom.xml so all Spark 3.X pom.xml will share it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support other Spark versions 3.0, 3.1, and 3.2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test it with those I can add data sources

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the implementation differs among the versions, we should definitely replicate the tests and test them for each version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants