-
Notifications
You must be signed in to change notification settings - Fork 68
add default max batch size and batchSize parameter for SF updates #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 8 commits
46e4f57
144f9f9
e3f58b0
bd21e6a
707921d
af32243
6b4fb6d
eda9081
131c4f3
67ec60a
b8c10ea
32effa7
0e62d2a
8f72643
e9b423c
d934530
dedec34
5f4437f
6770197
c1d8e0b
2451cb6
7e068e5
03491b1
3f9358d
f7ed7f2
99c125d
4661c15
e226b20
b6fe136
7581bf8
6706341
b19abd0
fab2d96
6a658f1
850f035
e1d7e59
9a04a88
7316f09
2d727e8
8ccd432
bd9c2fa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,20 +23,24 @@ class SFObjectWriter ( | |
| val mode: SaveMode, | ||
| val upsert: Boolean, | ||
| val externalIdFieldName: String, | ||
| val csvHeader: String | ||
| val csvHeader: String, | ||
| val batchSize: Integer | ||
| ) extends Serializable { | ||
|
|
||
| @transient val logger = Logger.getLogger(classOf[SFObjectWriter]) | ||
|
|
||
| def writeData(rdd: RDD[Row]): Boolean = { | ||
| val csvRDD = rdd.map(row => row.toSeq.map(value => Utils.rowValue(value)).mkString(",")) | ||
|
|
||
| val partitionCnt = (1 + csvRDD.count() / batchSize).toInt | ||
| val partitionedRDD = csvRDD.repartition(partitionCnt) | ||
|
Comment on lines
+44
to
+45
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't seem to be the right place to repartition as it's just leading to a 2nd round of shuffling the data around :/ Partitioning to control the size of ingest batches is already done in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A PR for the alternative approach is here #59 |
||
|
|
||
| val jobInfo = new JobInfo(WaveAPIConstants.STR_CSV, sfObject, operation(mode, upsert)) | ||
| jobInfo.setExternalIdFieldName(externalIdFieldName) | ||
|
|
||
| val jobId = bulkAPI.createJob(jobInfo).getId | ||
|
|
||
| csvRDD.mapPartitionsWithIndex { | ||
| partitionedRDD.mapPartitionsWithIndex { | ||
| case (index, iterator) => { | ||
| val records = iterator.toArray.mkString("\n") | ||
| var batchInfoId : String = null | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@unintellisense Wondering, is this related to this change? Why do you need to change to a fork of the api client?