Skip to content

Conversation

@harshavardhana
Copy link
Member

@harshavardhana harshavardhana commented Sep 4, 2019

This change fixes multiple issues

  • handles Unicode boundaries properly for special delimiters
  • handle zero payloads 'Cont' event messages
  • handle error messages properly

@harshavardhana
Copy link
Member Author

PR is updated with further changes @Praveenrajmani @sinhaashish PTAL

@sinhaashish
Copy link
Contributor

sinhaashish commented Sep 8, 2019

Breaking for this DELIMITER_CH = '╦'

data = client.select_object_content('wlk-data-wbrp', '20190612-00690-1/wlk-wbrp-part-0000.csv.gz', options)

  output_serialization=OutputSerialization(
        csv=CSVOutput(QuoteFields="ASNEEDED",
                      RecordDelimiter="\n",
                      FieldDelimiter=DELIMITER_CH,
                      QuoteCharacter='"',
                      QuoteEscapeCharacter='"',)
Traceback (most recent call last):
  File "examples/select_object_content.py", line 70, in <module>
    data = client.select_object_content('wlk-data-wbrp', '20190612-00690-1/wlk-wbrp-part-0000.csv.gz', options)
  File "build/bdist.linux-x86_64/egg/minio/api.py", line 255, in select_object_content
  File "build/bdist.linux-x86_64/egg/minio/xml_marshal.py", line 114, in xml_marshal_select
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

@harshavardhana
Copy link
Member Author

harshavardhana commented Sep 8, 2019

Unicode characters should be inputs for python as u'character' in python2, python3 supports them natively.

@harshavardhana
Copy link
Member Author

from minio import Minio
from minio.error import ResponseError

from minio.select.options import (SelectObjectOptions, CSVInput,
                                  JSONInput, RequestProgress,
                                  ParquetInput, InputSerialization,
                                  OutputSerialization, CSVOutput,
                                  JsonOutput)
from minio.select.errors import (SelectCRCValidationError, SelectMessageError)

client = Minio('s3.amazonaws.com',
               access_key='ACCESSKEY',
               secret_key='SECRETKEY')

options = SelectObjectOptions(
    expression="select * from s3object",
    input_serialization=InputSerialization(
        compression_type="GZIP",
        csv=CSVInput(FileHeaderInfo="USE",
                     RecordDelimiter="\n",
                     FieldDelimiter=u'╦',
                     QuoteCharacter='"',
                     QuoteEscapeCharacter='"',
                     Comments="#",
                     AllowQuotedRecordDelimiter="FALSE",
                     ),
        # If input is JSON
        # json=JSONInput(Type="DOCUMENT",)
        ),

    output_serialization=OutputSerialization(
        csv=CSVOutput(QuoteFields="ASNEEDED",
                      RecordDelimiter="\n",
                      FieldDelimiter=u'╦',
                      QuoteCharacter='"',
                      QuoteEscapeCharacter='"',)

        # json = JsonOutput(
        #     RecordDelimiter="\n",
        #     )
        ),
    request_progress=RequestProgress(
        enabled="False"
        )
    )

try:
    data = client.select_object_content('wlk-data-wbrp', '20190612-00690-1/wlk-wbrp-part-0000.csv.gz', options)

    # Get the records
    with open('my-record-file', 'w') as record_data:
        for d in data.stream(10*1024):
            record_data.write(d)

    # Get the stats
    print(data.stats())

except SelectMessageError as err:
    print(err)

except SelectCRCValidationError as err:
    print(err)

except ResponseError as err:
    print(err)

sinhaashish
sinhaashish previously approved these changes Sep 9, 2019
Copy link
Contributor

@sinhaashish sinhaashish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with different inputs and LGTM ,
Just SelectSelectCRCValidationError -> SelectCRCValidationError in examples/select_object_content.py

Praveenrajmani
Praveenrajmani previously approved these changes Sep 9, 2019
Copy link
Collaborator

@Praveenrajmani Praveenrajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This change fixes multiple issues

- handles unicode boundaries properly for special delimiters
- handle zero payload 'Cont' event messages
- handle error messages properly
@nitisht nitisht merged commit 0625257 into minio:master Sep 10, 2019
@harshavardhana harshavardhana deleted the fix-obj branch September 10, 2019 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants