Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Columns method #164

Open
OlivierBlanvillain opened this issue Aug 8, 2017 · 7 comments
Open

Missing Columns method #164

OlivierBlanvillain opened this issue Aug 8, 2017 · 7 comments

Comments

@OlivierBlanvillain
Copy link
Contributor

OlivierBlanvillain commented Aug 8, 2017

Exhaustive status of the API implemented by frameless.TypedColumn compared to Spark's Column. It's split into two, the methods implemented directly on Columns, and the methods comings from org.apache.spark.sql.functions._

Column methods

Won't fix:

  • Column alias(String alias) inherently unsafe
  • Column apply(Object extraction) inherently unsafe
  • Column as(String alias) inherently unsafe
  • Column name(String alias) inherently unsafe

TODO / done:

  • Column asc_nulls_first()
  • Column asc_nulls_last()
  • Column desc_nulls_first()
  • Column desc_nulls_last()
  • void explain(boolean extended)
  • Column eqNullSafe(Object other)
  • Column getField(String fieldName)
  • Column getItem(Object key)
  • Column isNotNull()
  • Column isNull()
  • Column like(String literal)
  • Column over()
  • Column over(WindowSpec window)
  • Column rlike(String literal)
  • Column isNaN()
  • Column substr(Column startPos, Column len) (WIP Add a typed 'substr' column method #263)
  • Column substr(int startPos, int len) (WIP Add a typed 'substr' column method #263)
  • Column mod(Object other) (WIP Added mod operator to TypedColumn.scala #296)
  • Column between(Object lowerBound, Object upperBound)
  • Column multiply(Object other)
  • Column endsWith(String literal)
  • Column isin(Object... list) (add isin to TypedColumn with restriction to primitive types #254)
  • Column startsWith(Column other)
  • Column startsWith(String literal)
  • Column otherwise(Object value)
  • Column when(Column condition, Object value)
  • Column and(Column other)
  • Column contains(Object other)
  • Column or(Column other)
  • Column bitwiseAND(Object other)
  • Column bitwiseOR(Object other)
  • Column bitwiseXOR(Object other)
  • TypedColumn<Object,U> as(Encoder evidence$1) (as cast)
  • Column asc() (as sortAscending)
  • Column cast(DataType to)
  • Column desc() (as sortDescending)
  • Column divide(Object other)
  • boolean equals(Object that) (as ===)
  • Column equalTo(Object other) (as ===)
  • org.apache.spark.sql.catalyst.expressions.Expression expr()
  • Column geq(Object other) (as >=)
  • Column gt(Object other) (as >)
  • Column leq(Object other) (as <=)
  • Column lt(Object other) (as <)
  • Column minus(Object other)
  • Column notEqual(Object other) (as =!=)
  • Column plus(Object other)
  • String toString()

org.apache.spark.sql.functions

TODO / done:

  • Column col(String colName) to be implemented using shapeless.Witness
  • Column add_months(Column startDate, int numMonths)
  • Column array(String colName, String... colNames)
  • Column asc_nulls_first(String columnName)
  • Column asc_nulls_last(String columnName)
  • Column asc(String columnName)
  • Dataset broadcast(Dataset df)
  • Column ceil(String columnName)
  • Column coalesce(Column... e)
  • Column cume_dist()
  • Column current_date()
  • Column current_timestamp()
  • Column date_add(Column start, int days)
  • Column date_format(Column dateExpr, String format)
  • Column date_sub(Column start, int days)
  • Column datediff(Column end, Column start)
  • Column dayofmonth(Column e)
  • Column dayofyear(Column e)
  • Column decode(Column value, String charset)
  • Column dense_rank()
  • Column desc_nulls_first(String columnName)
  • Column desc_nulls_last(String columnName)
  • Column desc(String columnName)
  • Column encode(Column value, String charset)
  • Column expm1(String columnName)
  • Column expr(String expr)
  • Column factorial(Column e)
  • Column first(String columnName, boolean ignoreNulls)
  • Column floor(String columnName)
  • Column format_number(Column x, int d)
  • Column format_string(String format, Column... arguments)
  • Column from_json(Column e, StructType schema, scala.collection.immutable.Map<String,String> options)
  • Column from_unixtime(Column ut, String f)
  • Column from_utc_timestamp(Column ts, String tz)
  • Column get_json_object(Column e, String path)
  • Column greatest(String columnName, String... columnNames)
  • Column grouping_id(String colName, scala.collection.Seq colNames)
  • Column grouping(String columnName)
  • Column hash(Column... cols)
  • Column hash(scala.collection.Seq cols)
  • Column hex(Column column)
  • Column hour(Column e)
  • Column initcap(Column e)
  • Column input_file_name()
  • Column isnan(Column e)
  • Column isnull(Column e)
  • Column json_tuple(Column json, String... fields)
  • Column lag(String columnName, int offset, Object defaultValue)
  • Column last_day(Column e)
  • Column last(String columnName, boolean ignoreNulls)
  • Column lead(String columnName, int offset, Object defaultValue)
  • Column least(String columnName, String... columnNames)
  • Column lit(Object literal)
  • Column locate(String substr, Column str, int pos)
  • Column map(Column... cols)
  • Column map(scala.collection.Seq cols)
  • Column md5(Column e)
  • Column mean(String columnName)
  • Column minute(Column e)
  • Column monotonicallyIncreasingId()
  • Column month(Column e)
  • Column months_between(Column date1, Column date2)
  • Column nanvl(Column col1, Column col2)
  • Column next_day(Column date, String dayOfWeek)
  • Column ntile(int n)
  • Column percent_rank()
  • Column posexplode(Column e)
  • Column quarter(Column e)
  • Column radians(String columnName)
  • Column rand()
  • Column rand(long seed)
  • Column randn()
  • Column randn(long seed)
  • Column rank()
  • Column regexp_extract(Column e, String exp, int groupIdx)
  • Column repeat(Column str, int n)
  • Column rint(String columnName)
  • Column round(Column e, int scale)
  • Column row_number()
  • Column second(Column e)
  • Column signum(String columnName)
  • Column sort_array(Column e, boolean asc)
  • Column soundex(Column e)
  • Column spark_partition_id()
  • Column split(Column str, String pattern)
  • Column struct(Column... cols)
  • Column struct(scala.collection.Seq cols)
  • Column struct(String colName, scala.collection.Seq colNames)
  • Column struct(String colName, String... colNames)
  • Column substring_index(Column str, String delim, int count)
  • Column sumDistinct(Column e)
  • Column sumDistinct(String columnName)
  • Column to_date(Column e)
  • Column to_json(Column e, Map<String,String> options)
  • Column to_utc_timestamp(Column ts, String tz)
  • Column translate(Column src, String matchingString, String replaceString)
  • Column trunc(Column date, String format)
  • Column unbase64(Column e)
  • Column unhex(Column column)
  • Column unix_timestamp()
  • Column unix_timestamp(Column s)
  • Column unix_timestamp(Column s, String p)
  • Column var_pop(String columnName)
  • Column var_samp(String columnName)
  • Column weekofyear(Column e)
  • Column when(Column condition, Object value)
  • Column window(Column timeColumn, String windowDuration)
  • Column window(Column timeColumn, String windowDuration, String slideDuration)
  • Column window(Column timeColumn, String windowDuration, String slideDuration, String startTime)
  • Column year(Column e)
  • Column conv(Column num, int fromBase, int toBase)
  • Column degrees(String columnName)
  • Column negate(Column e)
  • Column not(Column e)
  • Column hypot(String leftName, String rightName)
  • Column log(double base, String columnName)
  • Column log(String columnName)
  • Column log10(Column e)
  • Column log1p(Column e)
  • Column log2(Column expr)
  • Column pmod(Column dividend, Column divisor)
  • Column pow(String leftName, String rightName)
  • Column bround(Column e, int scale)
  • Column cbrt(String columnName)
  • Column crc32(Column e)
  • Column exp(String columnName)
  • Column sha1(Column e)
  • Column sha2(Column e, int numBits)
  • Column shiftLeft(Column e, int numBits)
  • Column shiftRight(Column e, int numBits)
  • Column shiftRightUnsigned(Column e, int numBits)
  • Column sqrt(String colName)
  • Column cos(String columnName)
  • Column cosh(String columnName)
  • Column sin(String columnName)
  • Column sinh(String columnName)
  • Column tan(String columnName)
  • Column tanh(String columnName)
  • Column approxCountDistinct(String columnName, double rsd)
  • Column avg(String columnName)
  • Column callUDF(String udfName, Column... cols)
  • Column collect_list(String columnName) (as collectList)
  • Column collect_set(String columnName) (as collectSet)
  • Column corr(String columnName1, String columnName2)
  • Column count(Column e)
  • Column countDistinct(String columnName, String... columnNames)
  • Column explode(Column e)
  • Column first(String columnName)
  • Column last(String columnName)
  • Column max(String columnName)
  • Column min(String columnName)
  • Column size(Column e)
  • Column stddev(String columnName)
  • Column sum(Column e)
  • UserDefinedFunction udf(scala.Function0 f, scala.reflect.api.TypeTags.TypeTag evidence$1)
  • <RT,A1> UserDefinedFunction udf(scala.Function1<A1,RT> f, scala.reflect.api.TypeTags.TypeTag evidence$2, scala.reflect.api.TypeTags.TypeTag evidence$3)
  • UserDefinedFunction udf(Object f, DataType dataType)
  • Column variance(String columnName)
  • Column stddev_pop(String columnName)
  • Column stddev_samp(String columnName)
  • Column covar_pop(String columnName1, String columnName2)
  • Column covar_samp(String columnName1, String columnName2)
  • Column kurtosis(String columnName)
  • Column skewness(String columnName)
  • Column abs(Column e)
  • Column acos(String columnName)
  • Column array_contains(Column column, Object value)
  • Column ascii(Column e)
  • Column asin(String columnName)
  • Column atan(String columnName)
  • Column atan2(String leftName, String rightName)
  • Column base64(Column e)
  • Column bin(String columnName)
  • Column bitwiseNOT(Column e)
  • Column concat_ws(String sep, Column... exprs)
  • Column concat(Column... exprs)
  • Column instr(Column str, String substring)
  • Column length(Column e)
  • Column levenshtein(Column l, Column r)
  • Column lower(Column e)
  • Column lpad(Column str, int len, String pad)
  • Column ltrim(Column e)
  • Column regexp_replace(Column e, String pattern, String replacement)
  • Column reverse(Column str)
  • Column rpad(Column str, int len, String pad)
  • Column rtrim(Column e)
  • Column substring(Column str, int pos, int len)
  • Column trim(Column e)
  • Column upper(Column e)
@imarios
Copy link
Contributor

imarios commented Aug 23, 2017

Hi @OlivierBlanvillain ! thanks for adding this! I think some are not relevant, like anything that has to do with "null" I actually replaced all of those with "isNone" "isNotNone". I don't remember in which PR I did that and I am not sure that is merged. I have to take another look.

@GrafBlutwurst
Copy link
Contributor

I'm in the groove of implementing those functions anyway will start mid-late september most likely. thanks for listing them all @OlivierBlanvillain !

@OlivierBlanvillain
Copy link
Contributor Author

@imarios It would be an interesting to try implementing these without affecting performance, getting there would be amazing!

@GrafBlutwurst Awesome, hopefully most them are really straightforward to implement, and with your bivariatePropTemplate/univariatePropTemplate helpers testing that is trivial!

@imarios imarios added the feature label Sep 1, 2017
@iravid
Copy link
Contributor

iravid commented Sep 21, 2017

Just saw this ticket - @OlivierBlanvillain could you elaborate on why functions.col is unsafe? Assuming we provide a version that uses shapeless Witness and verify that the symbol exists in T

@OlivierBlanvillain
Copy link
Contributor Author

Edited, nice catch! Indeed there not reason not to built a Witness powered version of that! (they are probably functions marked TODO that don't make sense, I didn't spent much time on each entry).

@rbraley
Copy link
Contributor

rbraley commented Sep 28, 2017

I can try to handle some of these as a first contribution to the repo :)

@pgabara
Copy link
Contributor

pgabara commented Mar 6, 2018

Hi Guys, just added my first PR with a typed substr column method. Let me know what you think about it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants