Skip to content

Conversation

@unintellisense
Copy link

I recently started using the project to perform bulk updates to Salesforce, and found it was not creating multiple batches when the row count exceeded the maximum 10,000 rows allowed.

This PR sets a default batch size of 5000 (since there is also a max size per batch I preferred to not start at 10k) and allows a batchSize parameter to be set if you prefer to change the batch size (often, the SF API will process multiple batches faster than one large batch, though your API limit is measured by number of batches).

@mosche
Copy link

mosche commented Sep 10, 2020

Running into the same issue, this would be very helpful @springml :)

build.sbt Outdated
Comment on lines 14 to 15
"com.springml" % "salesforce-wave-api" % "1.0.10",
"com.github.loanpal-engineering" % "salesforce-wave-api" % "eb71436",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unintellisense Wondering, is this related to this change? Why do you need to change to a fork of the api client?

@mosche
Copy link

mosche commented Sep 10, 2020

Looking into it a bit further, it might be better to improve Utils.repartition to also consider the number of rows when calculating the target number of partitions rather than just basing the number of partitions on the estimated size.

Comment on lines +35 to +36
val partitionCnt = (1 + csvRDD.count() / batchSize).toInt
val partitionedRDD = csvRDD.repartition(partitionCnt)
Copy link

@mosche mosche Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be the right place to repartition as it's just leading to a 2nd round of shuffling the data around :/ Partitioning to control the size of ingest batches is already done in Utils.repartition, so the limit of records per batch should be considered there:
https://github.com/springml/spark-salesforce/pull/59/files#diff-b359f3e710dff2341dbedadb012b9ff4R62-R73

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PR for the alternative approach is here #59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants