-
Notifications
You must be signed in to change notification settings - Fork 5
Description
From my onion-peeling adventure, I gather that if I read a worksheet via rexcel_read(), I drop into rexcel_read_workbook() and then into rexcel_read_worksheet(). At least, those are the exported functions called. It feels like there's one more layer or one more function that necessary? Q1: Can you help me understand the role of rexcel_read()? I think it's the one whose purpose isn't clear.
In googlesheets, for better or worse, there's an explicit registration step, that creates an R object with metadata about a Google Sheet. Only with that in hand can you start reading stuff back out of it. With Google Sheets, this is practically a requirement vs. a voluntary design decision. But would a similar workflow make sense for rexcel?
I think I'm proposing that most of what's in rexcel_read_workbook() get moved into a workbook "registration" function. So that it's possible to get set up to read a workbook w/o actually diving down into any worksheets (currently not possible, I believe?).
I also think (correct me) that current reading functions leave behind little to no info for worksheets that weren't specifically requested. Again, for a Google Sheet, when I register it, I create an overview of all worksheets (name and extent,mostly). When I think about us characterizing the Enron corpus, it would be nice to be able to register each workbook (15K) and get high-level info on the worksheets (80K) w/o necessarily reading their cells.
Q2: what do you think of a registration-based workflow?
Q3: what do you think of marshalling more data about worksheets at registration / workbook creation time? It creates an intermediate between practically no info and full reading of cells, etc.
Finally, it seems like one can return a linen::worksheet (rexcel_read_worksheet() does) and I wonder what that even means. Early on, the student who worked with me on googlesheets also allowed direct access to worksheets and this caused trouble. Technically, it was a problem because she implemented it in a way that ran up against some of XMLs worst gotchas re: memory leakage. But conceptually it was also tricky. A worksheet can't exist outside a workbook, so you were always dragging around host workbook info anyway. So we implemented a policy where you either interacted with the object that comes from registering a sheet or with data coming out of the sheet. But there was no user-facing tangible notion of anything in between. I know our situation is different (R6 class, local xlsx, etc.) but still ....
Q4: what's the deal with worksheet objects? This question is kinda vague. Sorry.
let me know if we should just Skype for some/all of these