Skip to content

initializing a workbook #8

@jennybc

Description

@jennybc

@richfitz

From my onion-peeling adventure, I gather that if I read a worksheet via rexcel_read(), I drop into rexcel_read_workbook() and then into rexcel_read_worksheet(). At least, those are the exported functions called. It feels like there's one more layer or one more function that necessary? Q1: Can you help me understand the role of rexcel_read()? I think it's the one whose purpose isn't clear.

In googlesheets, for better or worse, there's an explicit registration step, that creates an R object with metadata about a Google Sheet. Only with that in hand can you start reading stuff back out of it. With Google Sheets, this is practically a requirement vs. a voluntary design decision. But would a similar workflow make sense for rexcel?

I think I'm proposing that most of what's in rexcel_read_workbook() get moved into a workbook "registration" function. So that it's possible to get set up to read a workbook w/o actually diving down into any worksheets (currently not possible, I believe?).

I also think (correct me) that current reading functions leave behind little to no info for worksheets that weren't specifically requested. Again, for a Google Sheet, when I register it, I create an overview of all worksheets (name and extent,mostly). When I think about us characterizing the Enron corpus, it would be nice to be able to register each workbook (15K) and get high-level info on the worksheets (80K) w/o necessarily reading their cells.

Q2: what do you think of a registration-based workflow?

Q3: what do you think of marshalling more data about worksheets at registration / workbook creation time? It creates an intermediate between practically no info and full reading of cells, etc.

Finally, it seems like one can return a linen::worksheet (rexcel_read_worksheet() does) and I wonder what that even means. Early on, the student who worked with me on googlesheets also allowed direct access to worksheets and this caused trouble. Technically, it was a problem because she implemented it in a way that ran up against some of XMLs worst gotchas re: memory leakage. But conceptually it was also tricky. A worksheet can't exist outside a workbook, so you were always dragging around host workbook info anyway. So we implemented a policy where you either interacted with the object that comes from registering a sheet or with data coming out of the sheet. But there was no user-facing tangible notion of anything in between. I know our situation is different (R6 class, local xlsx, etc.) but still ....

Q4: what's the deal with worksheet objects? This question is kinda vague. Sorry.

let me know if we should just Skype for some/all of these

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions