separate-columns with default target naming #78

genmeblog · 2022-10-09T22:01:26Z

https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/seperate.20with.20custom.20fn

genmeblog · 2022-10-10T10:44:20Z

This will be a breaking change (minor). By default source column will be replaced by the new one, on every case.

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y))
;; => _unnamed [1 8]:
;;    | :x | :y-0 | :y-1 | :y-2 | :y-3 | :y-4 | :y-5 | :y-6 |
;;    |---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
;;    |  1 |    2 |    3 |    9 |   10 |   11 |   22 |   33 |

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y reverse))
;; => _unnamed [1 8]:
;;    | :x | :y-0 | :y-1 | :y-2 | :y-3 | :y-4 | :y-5 | :y-6 |
;;    |---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
;;    |  1 |   33 |   22 |   11 |   10 |    9 |    3 |    2 |

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y (fn [input]
                             (zipmap "somenames" input))))
;; => _unnamed [1 7]:
;;    | :x |  a | s |  e |  m |  n | o |
;;    |---:|---:|--:|---:|---:|---:|--:|
;;    |  1 | 22 | 2 | 10 | 33 | 11 | 3 |

behrica · 2022-10-22T21:37:14Z

I am know wondering if this use case should be handled by "tc/seperate-column" or if it requires a complete new method, for performance reasons. The seq in your example [2 3 9 10 11 22 33] could be as well a double arrays, like this:

(def ds
  (-> (tc/dataset {:x [1] :y [(double-array [2 3 9 10 11 22 33])]})))

And to separate this (specialy when large) could be done optimized in this way:

(->
 (tech.v3.datatype/concat-buffers (:y ds))
 (tech.v3.tensor/reshape [(tc/row-count ds)
                          (-> ds :y first count)])
 (tech.v3.dataset.tensor/tensor->dataset))

(+ replacing the column: y in the ds with the news ds)

I suppose this is significantly faster then a generic "separate" implementation you have intc/seperate
It works as well for the persistent vector case above

test cases could be those:

(def ds-1
  (-> (tc/dataset {:x [1 2] :y [[2 3 9 10 11 22 33]
                                [2 3 9 10 11 22 33]]})))

(def ds-2
  (-> (tc/dataset {:x [1] :y [(double-array [2 3 9 10 11 22 33])]})))

(def ds-3
  (-> (tc/dataset {:x [1] :y [(list 2 3 9 10 11 22 33)]})))


(->
 (tech.v3.datatype/concat-buffers (:y ds-1))
 (tech.v3.tensor/reshape [(tc/row-count ds-1)
                          (-> ds-1 :y first count)])
 (tech.v3.dataset.tensor/tensor->dataset))

behrica · 2022-10-22T21:39:48Z

In some cases we want even to get the tensor back and not the data frame, so omit the last tensor->dataset call.

I think it is a usefull addition in tablecloth, often we go from a dataset to a conceptual 2-d matrix.
(but having the matrix rows inside a single dataset column)

behrica · 2022-10-22T22:10:03Z

Not sure about the reverse.
So starting from a dataset with several (numeric) columns, and suqeze them into a single column of native arrays.

behrica · 2022-10-22T22:11:02Z

For the reverse something like this is working, not sure if optimal:


(def ds
  ;; => _unnamed [3 2]:
  ;;    | :x-0 | :x-1 |
  ;;    |-----:|-----:|
  ;;    |    1 |    4 |
  ;;    |    2 |    5 |
  ;;    |    3 |    6 |
  (->
   (tc/dataset {:x-0 [1 2 3]
                :x-1 [4 5 6]})))
                


(def rows
  (->
   (tech.v3.datatype/concat-buffers (tc/columns ds))
   (tech.v3.tensor/reshape [(tc/column-count ds)
                            (tc/row-count ds)])
   (tech.v3.tensor/transpose [1 0])
   (tech.v3.tensor/rows)))

(tc/dataset {:x (map tech.v3.datatype/->double-array rows)})
;; => _unnamed [3 1]:
;;    |          :x |
;;    |-------------|
;;    | [D@1600011f |
;;    |  [D@fc74513 |
;;    | [D@20c51970 |

behrica · 2022-10-22T22:15:10Z

I would think that a pair of functions to go from one representation to the other would be useful.

genmeblog · 2022-10-24T07:45:36Z

Looks like it's very specific case, kind of transpose of matrix. I'm not sure if it belongs to TC.

The last case (reverse) can be done with join-columns and {:result-type double-array}

BTW, does tensor work on non-numerical data.

genmeblog · 2022-10-24T08:26:15Z

My original solution landed in 6.103

behrica · 2022-10-24T12:57:35Z

Numeric only.
I think there should be 2 methods for this in TC, they operate on Dataset.
Its a specific form of separate.

behrica · 2022-10-24T13:00:27Z

Numeric only.
I think there should be 2 methods for this in TC, they operate on a Dataset.
Its a specific form of separate.and require array of same type and length in each row.
I can do PR, as I have a use case.

behrica · 2022-10-24T13:01:27Z

But indeed goes into numeric stuff and going from a datset to a matrix

behrica · 2022-10-24T19:13:57Z

I will try it out forward and backward.
I hve the impressions, without proof, that my code above could be far more performant, but having some constraints.

I will measure it on a larger case.

behrica · 2022-10-24T19:57:15Z

As I thought. On a 1000 * 1000 double matrix-type of dataset:

(def ds (api/dataset {:x (map 
                          (fn [_] (double-array (range 1000)))
                          (range 1000))}))

we get factor 50 - 100 of execution time difference

(defn use-separate []
 (api/separate-column ds :x))

(defn use-reshape []
 (->
  (tech.v3.datatype/concat-buffers (:x ds))
  (tech.v3.tensor/reshape [(api/row-count ds)
                           (-> ds :x first count)])
  (tech.v3.dataset.tensor/tensor->dataset)))


(time (def _ (use-separate)))
;; Elapsed time: 3371.491881 msecs"
(time (def _ (use-reshape)))
;; "Elapsed time: 76.420533 msecs"

for producing the same dataset.

behrica · 2022-10-24T20:13:52Z

The reverse ie less of a difference, still factor 5:

(def ds-with-cols (use-reshape))

(time
 (def _  (api/join-columns ds-with-cols :x (api/column-names ds-with-cols) {:result-type double-array})))
;; elapsed time: 333.478279 msecs"
;;
;;
;;

(time
 (let [rows
       (->
        (tech.v3.datatype/concat-buffers (api/columns ds-with-cols))
        (tech.v3.tensor/reshape [(api/column-count ds-with-cols)
                                 (api/row-count ds-with-cols)])
        (tech.v3.tensor/transpose [1 0])
        (tech.v3.tensor/rows))]
   (api/dataset {:x (map tech.v3.datatype/->double-array rows)})))
;; "Elapsed time: 66.384538 msecs"

behrica · 2022-10-24T20:20:36Z

But I was wrong above, the code works as well with non numeric..

genmeblog · 2022-10-24T21:19:40Z

Yes, join-columns and separate-column are slow. I know that. These two funcitons are more general than just packing/unpacking sequence to/from column(s).
join-columns and separate-column are more-less the same as tidyr's extract, separate and unite functions.

Your example is just one special case - which can be optimized for sure. If you have an idea for PR - it's always welcome.

behrica mentioned this issue Oct 24, 2022

Create functions for packing / unpacking columns to arrays #82

Closed

behrica closed this as completed Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

separate-columns with default target naming #78

separate-columns with default target naming #78

genmeblog commented Oct 9, 2022

genmeblog commented Oct 10, 2022

behrica commented Oct 22, 2022

behrica commented Oct 22, 2022 •

edited

Loading

behrica commented Oct 22, 2022

behrica commented Oct 22, 2022 •

edited

Loading

behrica commented Oct 22, 2022

genmeblog commented Oct 24, 2022

genmeblog commented Oct 24, 2022

behrica commented Oct 24, 2022

behrica commented Oct 24, 2022

behrica commented Oct 24, 2022

behrica commented Oct 24, 2022

behrica commented Oct 24, 2022 •

edited

Loading

behrica commented Oct 24, 2022

behrica commented Oct 24, 2022

genmeblog commented Oct 24, 2022 •

edited

Loading

separate-columns with default target naming #78

separate-columns with default target naming #78

Comments

genmeblog commented Oct 9, 2022

genmeblog commented Oct 10, 2022

behrica commented Oct 22, 2022

behrica commented Oct 22, 2022 • edited Loading

behrica commented Oct 22, 2022

behrica commented Oct 22, 2022 • edited Loading

behrica commented Oct 22, 2022

genmeblog commented Oct 24, 2022

genmeblog commented Oct 24, 2022

behrica commented Oct 24, 2022

behrica commented Oct 24, 2022

behrica commented Oct 24, 2022

behrica commented Oct 24, 2022

behrica commented Oct 24, 2022 • edited Loading

behrica commented Oct 24, 2022

behrica commented Oct 24, 2022

genmeblog commented Oct 24, 2022 • edited Loading

behrica commented Oct 22, 2022 •

edited

Loading

behrica commented Oct 22, 2022 •

edited

Loading

behrica commented Oct 24, 2022 •

edited

Loading

genmeblog commented Oct 24, 2022 •

edited

Loading