-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
separate-columns with default target naming #78
Comments
This will be a breaking change (minor). By default source column will be replaced by the new one, on every case. (-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
(tc/separate-column :y))
;; => _unnamed [1 8]:
;; | :x | :y-0 | :y-1 | :y-2 | :y-3 | :y-4 | :y-5 | :y-6 |
;; |---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
;; | 1 | 2 | 3 | 9 | 10 | 11 | 22 | 33 |
(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
(tc/separate-column :y reverse))
;; => _unnamed [1 8]:
;; | :x | :y-0 | :y-1 | :y-2 | :y-3 | :y-4 | :y-5 | :y-6 |
;; |---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
;; | 1 | 33 | 22 | 11 | 10 | 9 | 3 | 2 |
(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
(tc/separate-column :y (fn [input]
(zipmap "somenames" input))))
;; => _unnamed [1 7]:
;; | :x | a | s | e | m | n | o |
;; |---:|---:|--:|---:|---:|---:|--:|
;; | 1 | 22 | 2 | 10 | 33 | 11 | 3 | |
I am know wondering if this use case should be handled by "tc/seperate-column" or if it requires a complete new method, for performance reasons. The seq in your example
And to separate this (specialy when large) could be done optimized in this way:
(+ replacing the column: y in the ds with the news ds) I suppose this is significantly faster then a generic "separate" implementation you have in test cases could be those:
|
In some cases we want even to get the tensor back and not the data frame, so omit the last I think it is a usefull addition in tablecloth, often we go from a dataset to a conceptual 2-d matrix. |
Not sure about the reverse. |
For the reverse something like this is working, not sure if optimal:
|
I would think that a pair of functions to go from one representation to the other would be useful. |
Looks like it's very specific case, kind of transpose of matrix. I'm not sure if it belongs to TC. The last case (reverse) can be done with BTW, does tensor work on non-numerical data. |
My original solution landed in |
Numeric only. |
Numeric only. |
But indeed goes into numeric stuff and going from a datset to a matrix |
I will try it out forward and backward. I will measure it on a larger case. |
As I thought. On a 1000 * 1000 double matrix-type of dataset:
we get factor 50 - 100 of execution time difference
for producing the same dataset. |
The reverse ie less of a difference, still factor 5:
|
But I was wrong above, the code works as well with non numeric.. |
Yes, Your example is just one special case - which can be optimized for sure. If you have an idea for PR - it's always welcome. |
https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/seperate.20with.20custom.20fn
The text was updated successfully, but these errors were encountered: