Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add no_hardlinks option to LocalConfig and fix error handling #642

Closed
wants to merge 8 commits into from

Conversation

mrchypark
Copy link
Contributor

@mrchypark mrchypark commented Aug 29, 2024

What

Add no_hardlinks option to LocalConfig for object store operations. Try to resolve this issue.

How

  • Added no_hardlinks boolean field to LocalConfig struct
  • Updated from_hashmap and to_hashmap methods to handle the new field
  • Modified copy_if_not_exists and rename_if_not_exists methods in InternalObjectStore to use no_hardlinks option
  • Added necessary imports in relevant files

Why

  • This change allows users to disable hardlink creation in local object store operations
  • Improves flexibility for systems where hardlinks are not supported or desired
  • Enables more efficient file operations in certain scenarios, potentially improving performance
  • Addresses compatibility issues with network drives and certain CSI (Container Storage Interface) implementations
    • Some network drives and CSI providers (e.g., Azure Blob CSI) do not support hardlinks
    • This option ensures smooth operation in environments where hardlink creation is not possible or not allowed

Tests

  • Added new unit tests to verify the behavior of no_hardlinks option
  • Updated existing tests to cover the new functionality
  • All existing tests are passing with the new changes

@mrchypark mrchypark marked this pull request as ready for review August 29, 2024 09:58
object_store_factory/src/local.rs Outdated Show resolved Hide resolved
@mrchypark
Copy link
Contributor Author

I've reviewed the changes in apache/arrow-rs#5094 and made some additional modifications.

I'm pleased to report that we've made significant progress. The CREATE TABLE, INSERT, and ALTER TABLE RENAME TO operations are now working without errors.

Given that this functionality is designed for file systems without hardlinks, which presents some testing challenges, I've conducted manual experiments after building the project. The results have been positive so far.

Here's an example of the SQL operations I've successfully tested:

CREATE TABLE test (application_ID TEXT);
SELECT * FROM test;
INSERT INTO test (application_ID)
VALUES 
    ('APP2024-001'),
    ('APP2024-002'),
    ('APP2024-003'),
    ('APP2024-004'),
    ('APP2024-005'),
    ('APP2024-006'),
    ('APP2024-007'),
    ('APP2024-008'),
    ('APP2024-009'),
    ('APP2024-010');
SELECT * FROM test;
ALTER TABLE test RENAME TO testy;
SELECT * FROM testy;

All of these operations executed without any errors, which is encouraging.

I'd appreciate any suggestions for other common query operations that we should verify. Your expertise would be valuable in ensuring we've covered all the essential scenarios.

Thank you for your ongoing support and collaboration on this project!

@mildbyte
Copy link
Contributor

mildbyte commented Sep 2, 2024

@mrchypark Looks good, thanks! Do you mind fixing the formatting / Clippy errors in CI (it's failing right now: https://github.com/mrchypark/seafowl/actions/runs/10667722628/job/29565871004)? I'm trying to see how to enable workflows to run inside of this PR but can't, I can only see CI runs on your original repo.

@gruuya
Copy link
Contributor

gruuya commented Sep 3, 2024

I'm trying to see how to enable workflows to run inside of this PR but can't, I can only see CI runs on your original repo.

I think we need to extend

on:
push:
to be

on: [push, pull_request]

We also probably need some settings tweaks along these lines: https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/enabling-features-for-your-repository/managing-github-actions-settings-for-a-repository#controlling-changes-from-forks-to-workflows-in-public-repositories

@mildbyte
Copy link
Contributor

mildbyte commented Sep 3, 2024

@gruuya got it, thanks!

@mrchypark if you rebase on our main branch to pick up https://github.com/splitgraph/seafowl/pull/649/files, we should be able to run CI on this PR right here

@mildbyte
Copy link
Contributor

mildbyte commented Sep 3, 2024

@mrchypark I checked your branch locally and fixed Clippy/formatting errors, if you push this patch to your branch, CI should pass and we'll merge:

commit 45c9215f38d91491c42e6aaf688c4985d6ed9802
Author: Artjoms Iškovs <mildbyte@gmail.com>
Date:   Tue Sep 3 10:47:50 2024 +0100

    Fix formatting and Clippy

diff --git a/object_store_factory/src/local.rs b/object_store_factory/src/local.rs
index 9d107a4..c494b2b 100644
--- a/object_store_factory/src/local.rs
+++ b/object_store_factory/src/local.rs
@@ -19,20 +19,27 @@ impl LocalConfig {
         map: &HashMap<String, String>,
     ) -> Result<Self, object_store::Error> {
         Ok(Self {
-          data_dir: map.get("data_dir")
-          .ok_or_else(|| object_store::Error::Generic {
-              store: "local",
-              source: "Missing data_dir".into(),
-          })?
-          .clone(),
-            disable_hardlinks: map.get("disable_hardlinks").map(|s| s == "true").unwrap_or(false),
+            data_dir: map
+                .get("data_dir")
+                .ok_or_else(|| object_store::Error::Generic {
+                    store: "local",
+                    source: "Missing data_dir".into(),
+                })?
+                .clone(),
+            disable_hardlinks: map
+                .get("disable_hardlinks")
+                .map(|s| s == "true")
+                .unwrap_or(false),
         })
     }
 
     pub fn to_hashmap(&self) -> HashMap<String, String> {
         let mut map = HashMap::new();
         map.insert("data_dir".to_string(), self.data_dir.clone());
-        map.insert("disable_hardlinks".to_string(), self.disable_hardlinks.to_string());
+        map.insert(
+            "disable_hardlinks".to_string(),
+            self.disable_hardlinks.to_string(),
+        );
         map
     }
 
@@ -57,7 +64,7 @@ mod tests {
         let config = LocalConfig::from_hashmap(&map)
             .expect("Failed to create config from hashmap");
         assert_eq!(config.data_dir, "/tmp/data".to_string());
-        assert_eq!(config.disable_hardlinks, false); // Default value
+        assert!(!config.disable_hardlinks); // Default value
     }
 
     #[test]
@@ -69,7 +76,7 @@ mod tests {
         let config = LocalConfig::from_hashmap(&map)
             .expect("Failed to create config from hashmap");
         assert_eq!(config.data_dir, "/tmp/data".to_string());
-        assert_eq!(config.disable_hardlinks, true);
+        assert!(config.disable_hardlinks);
     }
 
     #[test]
@@ -81,7 +88,7 @@ mod tests {
         let config = LocalConfig::from_hashmap(&map)
             .expect("Failed to create config from hashmap");
         assert_eq!(config.data_dir, "/tmp/data".to_string());
-        assert_eq!(config.disable_hardlinks, false);
+        assert!(!config.disable_hardlinks);
     }
 
     #[test]
@@ -133,7 +140,7 @@ mod tests {
 
     #[test]
     fn test_default_false() {
-        assert_eq!(default_false(), false);
+        assert!(!default_false());
     }
 
     #[test]
@@ -146,7 +153,7 @@ mod tests {
 
         let config: LocalConfig = serde_json::from_str(json).unwrap();
         assert_eq!(config.data_dir, "/tmp/data");
-        assert_eq!(config.disable_hardlinks, false);
+        assert!(!config.disable_hardlinks);
     }
 
     #[test]
@@ -160,6 +167,6 @@ mod tests {
 
         let config: LocalConfig = serde_json::from_str(json).unwrap();
         assert_eq!(config.data_dir, "/tmp/data");
-        assert_eq!(config.disable_hardlinks, true);
+        assert!(config.disable_hardlinks);
     }
-}
\ No newline at end of file
+}
diff --git a/src/main.rs b/src/main.rs
index 46eab5f..1caf26e 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -29,7 +29,7 @@ use seafowl::{
 
 use tokio::time::{interval, Duration};
 use tracing::level_filters::LevelFilter;
-use tracing::{error, info, debug, subscriber, warn};
+use tracing::{debug, error, info, subscriber, warn};
 use tracing_log::LogTracer;
 use tracing_subscriber::filter::EnvFilter;
 
diff --git a/src/object_store/wrapped.rs b/src/object_store/wrapped.rs
index ad4446d..a17b10c 100644
--- a/src/object_store/wrapped.rs
+++ b/src/object_store/wrapped.rs
@@ -16,8 +16,8 @@ use url::Url;
 
 use object_store_factory::aws::S3Config;
 use object_store_factory::google::GCSConfig;
-use object_store_factory::ObjectStoreConfig;
 use object_store_factory::local::LocalConfig;
+use object_store_factory::ObjectStoreConfig;
 
 // Wrapper around the object_store crate that holds on to the original config
 // in order to provide a more efficient "upload" for the local object store
@@ -152,8 +152,22 @@ impl ObjectStore for InternalObjectStore {
         payload: PutPayload,
         opts: PutOptions,
     ) -> Result<PutResult> {
-        if let  ObjectStoreConfig::Local(LocalConfig { disable_hardlinks: true, .. }) = self.config {
-          return self.inner.put_opts(location, payload, PutOptions{mode: object_store::PutMode::Overwrite, ..opts}).await
+        if let ObjectStoreConfig::Local(LocalConfig {
+            disable_hardlinks: true,
+            ..
+        }) = self.config
+        {
+            return self
+                .inner
+                .put_opts(
+                    location,
+                    payload,
+                    PutOptions {
+                        mode: object_store::PutMode::Overwrite,
+                        ..opts
+                    },
+                )
+                .await;
         };
         self.inner.put_opts(location, payload, opts).await
     }
@@ -243,7 +257,11 @@ impl ObjectStore for InternalObjectStore {
     ///
     /// Will return an error if the destination already has an object.
     async fn copy_if_not_exists(&self, from: &Path, to: &Path) -> Result<()> {
-        if let  ObjectStoreConfig::Local(LocalConfig { disable_hardlinks: true, .. }) = self.config {
+        if let ObjectStoreConfig::Local(LocalConfig {
+            disable_hardlinks: true,
+            ..
+        }) = self.config
+        {
             return self.inner.copy(from, to).await;
         }
         self.inner.copy_if_not_exists(from, to).await
@@ -261,7 +279,11 @@ impl ObjectStore for InternalObjectStore {
             // this with a lock too, so look into using that down the line instead.
             return self.inner.rename(from, to).await;
         }
-        if let ObjectStoreConfig::Local(LocalConfig { disable_hardlinks: true, .. }) = self.config {
+        if let ObjectStoreConfig::Local(LocalConfig {
+            disable_hardlinks: true,
+            ..
+        }) = self.config
+        {
             return self.inner.rename(from, to).await;
         }
         self.inner.rename_if_not_exists(from, to).await
@@ -274,7 +296,7 @@ mod tests {
     use crate::object_store::wrapped::InternalObjectStore;
     use datafusion::common::Result;
     use rstest::rstest;
-    
+
     use object_store_factory::aws::S3Config;
     use object_store_factory::ObjectStoreConfig;

@mildbyte
Copy link
Contributor

mildbyte commented Sep 3, 2024

OK @mrchypark I'll fix the build on a separate branch in our repo and raise it as a separate PR, give me a few minutes

@mrchypark
Copy link
Contributor Author

What am I missing?

@mildbyte
Copy link
Contributor

mildbyte commented Sep 3, 2024

All good now @mrchypark, I fixed the tests in #650 and we'll merge that PR instead, I'll close this PR. Thanks a lot for the contribution!

@mildbyte mildbyte closed this Sep 3, 2024
@gruuya
Copy link
Contributor

gruuya commented Sep 3, 2024

@mrchypark for future needs you can also do pre-commit install in the seafowl repo, making the lint run on git commit (typically saves you from additional ci runs/pushes to fix those).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure Blob CSI Driver Compatibility Issue with Seafowl
4 participants