Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate-columns with default target naming #78

Closed
genmeblog opened this issue Oct 9, 2022 · 16 comments
Closed

separate-columns with default target naming #78

genmeblog opened this issue Oct 9, 2022 · 16 comments

Comments

@genmeblog
Copy link
Member

https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/seperate.20with.20custom.20fn

@genmeblog
Copy link
Member Author

This will be a breaking change (minor). By default source column will be replaced by the new one, on every case.

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y))
;; => _unnamed [1 8]:
;;    | :x | :y-0 | :y-1 | :y-2 | :y-3 | :y-4 | :y-5 | :y-6 |
;;    |---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
;;    |  1 |    2 |    3 |    9 |   10 |   11 |   22 |   33 |

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y reverse))
;; => _unnamed [1 8]:
;;    | :x | :y-0 | :y-1 | :y-2 | :y-3 | :y-4 | :y-5 | :y-6 |
;;    |---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
;;    |  1 |   33 |   22 |   11 |   10 |    9 |    3 |    2 |

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y (fn [input]
                             (zipmap "somenames" input))))
;; => _unnamed [1 7]:
;;    | :x |  a | s |  e |  m |  n | o |
;;    |---:|---:|--:|---:|---:|---:|--:|
;;    |  1 | 22 | 2 | 10 | 33 | 11 | 3 |

@behrica
Copy link
Member

behrica commented Oct 22, 2022

I am know wondering if this use case should be handled by "tc/seperate-column" or if it requires a complete new method, for performance reasons. The seq in your example [2 3 9 10 11 22 33] could be as well a double arrays, like this:

(def ds
  (-> (tc/dataset {:x [1] :y [(double-array [2 3 9 10 11 22 33])]})))

And to separate this (specialy when large) could be done optimized in this way:

(->
 (tech.v3.datatype/concat-buffers (:y ds))
 (tech.v3.tensor/reshape [(tc/row-count ds)
                          (-> ds :y first count)])
 (tech.v3.dataset.tensor/tensor->dataset))

(+ replacing the column: y in the ds with the news ds)

I suppose this is significantly faster then a generic "separate" implementation you have intc/seperate
It works as well for the persistent vector case above

test cases could be those:

(def ds-1
  (-> (tc/dataset {:x [1 2] :y [[2 3 9 10 11 22 33]
                                [2 3 9 10 11 22 33]]})))

(def ds-2
  (-> (tc/dataset {:x [1] :y [(double-array [2 3 9 10 11 22 33])]})))

(def ds-3
  (-> (tc/dataset {:x [1] :y [(list 2 3 9 10 11 22 33)]})))


(->
 (tech.v3.datatype/concat-buffers (:y ds-1))
 (tech.v3.tensor/reshape [(tc/row-count ds-1)
                          (-> ds-1 :y first count)])
 (tech.v3.dataset.tensor/tensor->dataset))

@behrica
Copy link
Member

behrica commented Oct 22, 2022

In some cases we want even to get the tensor back and not the data frame, so omit the last tensor->dataset call.

I think it is a usefull addition in tablecloth, often we go from a dataset to a conceptual 2-d matrix.
(but having the matrix rows inside a single dataset column)

@behrica
Copy link
Member

behrica commented Oct 22, 2022

Not sure about the reverse.
So starting from a dataset with several (numeric) columns, and suqeze them into a single column of native arrays.

@behrica
Copy link
Member

behrica commented Oct 22, 2022

For the reverse something like this is working, not sure if optimal:


(def ds
  ;; => _unnamed [3 2]:
  ;;    | :x-0 | :x-1 |
  ;;    |-----:|-----:|
  ;;    |    1 |    4 |
  ;;    |    2 |    5 |
  ;;    |    3 |    6 |
  (->
   (tc/dataset {:x-0 [1 2 3]
                :x-1 [4 5 6]})))
                


(def rows
  (->
   (tech.v3.datatype/concat-buffers (tc/columns ds))
   (tech.v3.tensor/reshape [(tc/column-count ds)
                            (tc/row-count ds)])
   (tech.v3.tensor/transpose [1 0])
   (tech.v3.tensor/rows)))

(tc/dataset {:x (map tech.v3.datatype/->double-array rows)})
;; => _unnamed [3 1]:
;;    |          :x |
;;    |-------------|
;;    | [D@1600011f |
;;    |  [D@fc74513 |
;;    | [D@20c51970 |

@behrica
Copy link
Member

behrica commented Oct 22, 2022

I would think that a pair of functions to go from one representation to the other would be useful.

@genmeblog
Copy link
Member Author

Looks like it's very specific case, kind of transpose of matrix. I'm not sure if it belongs to TC.

The last case (reverse) can be done with join-columns and {:result-type double-array}

BTW, does tensor work on non-numerical data.

@genmeblog
Copy link
Member Author

My original solution landed in 6.103

@behrica
Copy link
Member

behrica commented Oct 24, 2022

Numeric only.
I think there should be 2 methods for this in TC, they operate on Dataset.
Its a specific form of separate.

@behrica
Copy link
Member

behrica commented Oct 24, 2022

Numeric only.
I think there should be 2 methods for this in TC, they operate on a Dataset.
Its a specific form of separate.and require array of same type and length in each row.
I can do PR, as I have a use case.

@behrica
Copy link
Member

behrica commented Oct 24, 2022

But indeed goes into numeric stuff and going from a datset to a matrix

@behrica
Copy link
Member

behrica commented Oct 24, 2022

I will try it out forward and backward.
I hve the impressions, without proof, that my code above could be far more performant, but having some constraints.

I will measure it on a larger case.

@behrica
Copy link
Member

behrica commented Oct 24, 2022

As I thought. On a 1000 * 1000 double matrix-type of dataset:

(def ds (api/dataset {:x (map 
                          (fn [_] (double-array (range 1000)))
                          (range 1000))}))

we get factor 50 - 100 of execution time difference

(defn use-separate []
 (api/separate-column ds :x))

(defn use-reshape []
 (->
  (tech.v3.datatype/concat-buffers (:x ds))
  (tech.v3.tensor/reshape [(api/row-count ds)
                           (-> ds :x first count)])
  (tech.v3.dataset.tensor/tensor->dataset)))


(time (def _ (use-separate)))
;; Elapsed time: 3371.491881 msecs"
(time (def _ (use-reshape)))
;; "Elapsed time: 76.420533 msecs"

for producing the same dataset.

@behrica
Copy link
Member

behrica commented Oct 24, 2022

The reverse ie less of a difference, still factor 5:

(def ds-with-cols (use-reshape))

(time
 (def _  (api/join-columns ds-with-cols :x (api/column-names ds-with-cols) {:result-type double-array})))
;; elapsed time: 333.478279 msecs"
;;
;;
;;

(time
 (let [rows
       (->
        (tech.v3.datatype/concat-buffers (api/columns ds-with-cols))
        (tech.v3.tensor/reshape [(api/column-count ds-with-cols)
                                 (api/row-count ds-with-cols)])
        (tech.v3.tensor/transpose [1 0])
        (tech.v3.tensor/rows))]
   (api/dataset {:x (map tech.v3.datatype/->double-array rows)})))
;; "Elapsed time: 66.384538 msecs"

@behrica
Copy link
Member

behrica commented Oct 24, 2022

But I was wrong above, the code works as well with non numeric..

@genmeblog
Copy link
Member Author

genmeblog commented Oct 24, 2022

Yes, join-columns and separate-column are slow. I know that. These two funcitons are more general than just packing/unpacking sequence to/from column(s).
join-columns and separate-column are more-less the same as tidyr's extract, separate and unite functions.

Your example is just one special case - which can be optimized for sure. If you have an idea for PR - it's always welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants