Skip to content

API discussion #1187

@st-pasha

Description

@st-pasha

This issue is for the general discussion of the datatable's API. It should only be closed when the discussion has stabilized, and the majority of the suggested syntax either implemented or delegated into separate issues.

First, as a general principle, datatable is a sibling of R's data.table, and aims to mimic its API / algorithms whenever possible and reasonable. At the same time, many of the design choices that went into data.table stem from the functionality of base R; such functionality may be awkward when transferred into Python directly. So some kind of balanced approach is needed. Finally, it must be acknowledged that R gives much more freedom in syntactic expression to the user, which means many of the constructs used in data.table are simply not possible in Python.

Main syntax

The cornerstone of data.table's API is the following syntactic form:

                                 DT[i, j, by, ...]

where ... denotes extra options. Here i and j are positional arguments, denoting the rows and columns selectors respectively (alternatively, j is often called the "what to do" argument, as it can specify arbitrary calculations over the columns). The by argument may also be positional, but more commonly it is used in named form (i.e. by=...), especially considering that it is frequently replaced with keyby=... which is another mode of grouping.

This syntax is good, and we want to generally retain it, however, there is a big caveat: Python does not support named parameters in square-brackets selectors. There is PEP-472 to add such support. The PEP dates back to 2014 and was on "standards track" for Py3.6, however, today Py3.7 is almost already out, and the proposal was not implemented yet. So don't get your hopes too high...

Given all this a considerable amount of thought, I come up with the following suggested primary syntax for datatable:

                                 DT[i, j]
                                 DT[i, j, by(...)]
                                 DT[i, j, join(...)]
                                 etc.

Thus, the simplest form uses DT[i, j], which is perfectly natural for indexing a 2-dimensional table of data. However, the grouping argument, if present, must be "named" via function by(). The function by() may accept multiple columns or column expressions, and also have its own parameters. For example, such parameters could be method = "fast"|"sorted"|"keep_order"|"rle" to choose the algorithm for grouping, add_cols = True|False whether to automatically add key columns to the resulting frame, skip_na = False|True whether an NA-valued group is dropped, filter=<expr> to remove certain groups based on a custom logic, and so on.

Likewise, the generic syntax to perform a join is the join() verb: DT[i, j, join(X, on=..., nomatch=..., mode=...)]. We can support the data.table's syntax DT[X] too, but I suspect it won't be very useful without the support of extra arguments such as on=, mult=, etc. Another point of distinction is that unlike DT[X], the expression DT[:, :, join(X)] will perform a left-outer-join with default params.

This takes care of most of the arguments to [.data.table. The arguments that do not fall into either by() or join() family are: nomatch, which, with and verbose. Out of these, with is not needed since in Python the mode with=TRUE does not work anyway, so we have to use f.* expressions. The verbose and nomatch parameters can be handled as global options. The which parameter is very awkward: a much cleaner approach is to have a special .WHICH symbol to be used in j.

f.* symbols

As mentioned above, the data.table's syntax DT[, A] to refer to column "A" cannot work in Python: A will be interpreted as variable from the outer scope, not as column "A" in DT. Of course, DT[:, "A"] is ok in Python, but then you cannot do expressions such as DT[:, "A" / "B"]. Presumably, you could put the entire expression into a string DT[:, "A / B"], but even this has its limitations.

Instead, we opted out for the f.* syntax: the f refers to the "frame currently being operated upon", and then f.A or f["A"] is the column "A" in that frame. The constant repetition of f. is somewhat tedious, but it has its own advantages too:

  • it is easy to refer to a column whose name is in a variable: f[var];
  • similarly, you can refer to a column whose name is not a valid identifier: f["Purchase price"];
  • it is possible to distinguish between the columns of the current frame and the columns of the joined frame, the latter will use prefix g;
  • data.table occasionally uses a similar approach by saying x.col or i.col;
  • the columnar expression(s) can be saved in a variable and then reused later.

In-place frame updates

In data.table the syntax for this is DT[i, col1:=expr]. This is nice, but there is no ":=" operator in Python (at least until PEP-572, but even that would not be overloadable). Instead, we currently implement the following syntax for updates: DT[i, col] = expr. This works fine in small use cases but quickly becomes unreadable in larger ones. Consider: DT[:, [colA, colB2, colC]] = [expr1, expr2, expr3] -- which column name gets assigned which expression? Or DT[:, col, join(X, ...), by(z)] = expr -- the column name and the expression are so far from each other that it becomes unclear what kind of assignment takes place.
One way to deal with this problem is to introduce a special syntax for updates:

DT.update[i, {colA: expr1, colB: expr2}, ...]

or alternatively

DT[i, update(colA=expr1, colB=expr2), ...]

Arbitrary group expressions

One of the most powerful features of data.table is the ability to perform arbitrary calculations with subsets of the target frame corresponding to each group. This is done via .SD special symbol: the j part of the DT[i, j, by] form can be an arbitrary function of .SD -- as long it creates a list (or a data.table) as a result.

A similar functionality can be achieved in Python datatable via a special function apply() (or do()) which can be used in place of j expression. This function may take either one or two arguments, and produce either a list, or list-of-lists, or a Frame, or None.

  • the function will be called once for each group in the source frame, or once for each row if no by() clause was given;
  • a single-argument function will be given the subframe of the data corresponding to each group (i.e. .SD);
  • a two-arguments function will be given the key value as the first arg, and the subset of data as the second arg;
  • if the function returns None, that value is ignored; if it returns a list/tuple, that value is converted into a 1-row frame; if it returns a list-of-lists, it is converted into a frame (each list element becomes a column);
  • at the end, all produced frames are rbind-ed together (or combined according to the combine option).

The apply() function may have options to control its behavior: sdcols - same as .SDcols in R, combine="rbind"|"cbind"|"list", etc.


Please share your thoughts / comments / suggestions.

Metadata

Metadata

Assignees

Labels

design-docGeneric discussion / roadmap how some major new functionality can be implemented

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions