API discussion

This issue is for the general discussion of the `datatable`'s API. It should only be closed when the discussion has stabilized, and the majority of the suggested syntax either implemented or delegated into separate issues.

First, as a general principle, `datatable` is a sibling of R's `data.table`, and aims to mimic its API / algorithms whenever possible and reasonable. At the same time, many of the design choices that went into `data.table` stem from the functionality of base R; such functionality may be awkward when transferred into Python directly. So some kind of balanced approach is needed. Finally, it must be acknowledged that R gives much more freedom in syntactic expression to the user, which means many of the constructs used in `data.table` are simply not possible in Python.

Main syntax
-----
The cornerstone of [data.table's API](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) is the following syntactic form:
```
                                 DT[i, j, by, ...]
```
where `...` denotes extra options. Here `i` and `j` are positional arguments, denoting the rows and columns selectors respectively (alternatively, `j` is often called the "what to do" argument, as it can specify arbitrary calculations over the columns). The `by` argument may also be positional, but more commonly it is used in named form (i.e. `by=...`), especially considering that it is frequently replaced with `keyby=...` which is another mode of grouping.

This syntax is good, and we want to generally retain it, however, there is a big caveat: Python does not support named parameters in square-brackets selectors. There is [PEP-472](https://www.python.org/dev/peps/pep-0472/) to add such support. The PEP dates back to 2014 and was on "standards track" for Py3.6, however, today Py3.7 is ~almost~ already out, and the proposal was not implemented yet. So don't get your hopes too high...

Given all this a considerable amount of thought, I come up with the following suggested primary syntax for `datatable`:
```
                                 DT[i, j]
                                 DT[i, j, by(...)]
                                 DT[i, j, join(...)]
                                 etc.
```
Thus, the simplest form uses `DT[i, j]`, which is perfectly natural for indexing a 2-dimensional table of data. However, the grouping argument, if present, *must* be "named" via function `by()`. The function `by()` may accept multiple columns or column expressions, and also have its own parameters. For example, such parameters could be `method = "fast"|"sorted"|"keep_order"|"rle"` to choose the algorithm for grouping, `add_cols = True|False` whether to automatically add key columns to the resulting frame, `skip_na = False|True` whether an NA-valued group is dropped, `filter=<expr>` to remove certain groups based on a custom logic, and so on.

Likewise, the generic syntax to perform a join is the `join()` verb: `DT[i, j, join(X, on=..., nomatch=..., mode=...)]`. We can support the data.table's syntax `DT[X]` too, but I suspect it won't be very useful without the support of extra arguments such as `on=`, `mult=`, etc. Another point of distinction is that unlike `DT[X]`, the expression `DT[:, :, join(X)]` will perform a left-outer-join with default params.

This takes care of most of the arguments to `[.data.table`. The arguments that do not fall into either `by()` or `join()` family are: `nomatch`, `which`, `with` and `verbose`. Out of these, `with` is not needed since in Python the mode `with=TRUE` does not work anyway, so we have to use `f.*` expressions. The `verbose` and `nomatch` parameters can be handled as global options. The `which` parameter is very awkward: a much cleaner approach is to have a special `.WHICH` symbol to be used in `j`.

f.* symbols
-----
As mentioned above, the data.table's syntax `DT[, A]` to refer to column "A" cannot work in Python: `A` will be interpreted as variable from the outer scope, not as column "A" in DT. Of course, `DT[:, "A"]` is ok in Python, but then you cannot do expressions such as `DT[:, "A" / "B"]`. Presumably, you could put the entire expression into a string `DT[:, "A / B"]`, but even this has its limitations.

Instead, we opted out for the `f.*` syntax: the `f` refers to the "frame currently being operated upon", and then `f.A` or `f["A"]` is the column "A" in that frame. The constant repetition of `f.` is somewhat tedious, but it has its own advantages too:
- it is easy to refer to a column whose name is in a variable: `f[var]`;
- similarly, you can refer to a column whose name is not a valid identifier: `f["Purchase price"]`;
- it is possible to distinguish between the columns of the current frame and the columns of the joined frame, the latter will use prefix `g`;
- `data.table` occasionally uses a similar approach by saying `x.col` or `i.col`;
- the columnar expression(s) can be saved in a variable and then reused later.

In-place frame updates
-----
In data.table the syntax for this is `DT[i, col1:=expr]`. This is nice, but there is no ":=" operator in Python (at least until [PEP-572](https://www.python.org/dev/peps/pep-0572/), but even that would not be overloadable). Instead, we currently implement the following syntax for updates: `DT[i, col] = expr`. This works fine in small use cases but quickly becomes unreadable in larger ones. Consider: `DT[:, [colA, colB2, colC]] = [expr1, expr2, expr3]` -- which column name gets assigned which expression? Or `DT[:, col, join(X, ...), by(z)] = expr` -- the column name and the expression are so far from each other that it becomes unclear what kind of assignment takes place.
One way to deal with this problem is to introduce a special syntax for updates:
```
DT.update[i, {colA: expr1, colB: expr2}, ...]
```
or alternatively
```
DT[i, update(colA=expr1, colB=expr2), ...]
```

Arbitrary group expressions
-----
One of the most powerful features of data.table is the ability to perform arbitrary calculations with subsets of the target frame corresponding to each group. This is done via `.SD` special symbol: the `j` part of the `DT[i, j, by]` form can be an arbitrary function of `.SD` -- as long it creates a list (or a data.table) as a result.

A similar functionality can be achieved in Python `datatable` via a special function `apply()` (or `do()`) which can be used in place of `j` expression. This function may take either one or two arguments, and produce either a list, or list-of-lists, or a Frame, or None.
- the function will be called once for each group in the source frame, or once for each row if no `by()` clause was given;
- a single-argument function will be given the subframe of the data corresponding to each group (i.e. `.SD`);
- a two-arguments function will be given the key value as the first arg, and the subset of data as the second arg;
- if the function returns `None`, that value is ignored; if it returns a list/tuple, that value is converted into a 1-row frame; if it returns a list-of-lists, it is converted into a frame (each list element becomes a column);
- at the end, all produced frames are rbind-ed together (or combined according to the `combine` option).

The `apply()` function may have options to control its behavior: `sdcols` - same as `.SDcols` in R, `combine="rbind"|"cbind"|"list"`, etc.


----
Please share your thoughts / comments / suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

API discussion #1187

Main syntax

f.* symbols

In-place frame updates

Arbitrary group expressions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API discussion #1187

Description

Main syntax

f.* symbols

In-place frame updates

Arbitrary group expressions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions