Skip to content
This repository was archived by the owner on Jun 4, 2026. It is now read-only.

Sync with Cloudera Impyla#2

Draft
thyarles wants to merge 68 commits into
smartlab-br:masterfrom
cloudera:master
Draft

Sync with Cloudera Impyla#2
thyarles wants to merge 68 commits into
smartlab-br:masterfrom
cloudera:master

Conversation

@thyarles
Copy link
Copy Markdown
Member

No description provided.

jasonmfehr and others added 16 commits June 7, 2023 21:31
…515)

that are applicable to username/password auth and JWT auth are not
mixed together on the same call to the connect method.

These additional checks prevent confusion about which authentication
method is actually used for the connection.

New tests were added to cover the new checks.
Impyla gets cookies from an HTTMessage object formed from a
response to an HTTP message. The format of cookies in the message
differs across the python versions. In Python 2 the HTTPMessage is a
mimetools.Message object, and the Set-Cookie values all appear in a
single header, separated by newlines. In Python 3 the HTTPMessage is an
email.message.Message, and the Set-Cookie values appear as duplicate
headers.

Add platform dependent code to get_all_matching_cookies() that loads
cookies from all the Set-Cookie headers.

TESTING:
    Changed test_get_all_matching_cookies() to build the HTTPMessage
    using a new utility method that creates Set-Cookie headers in
    the appropriate format for the platform.

    I hand tested with a proxy that inserted 3 cookies into http
    responses. I added the 3 cookie names to the list of default
    cookies. I ran TestHttpConnect.test_simple_connect() connecting
    to Impala through the proxy and verified with the debugger that
    the cookies were returned correctly from
    get_all_matching_cookies() in both python2 and python3.
Co-authored-by: cravani <cravani@cloudera.com>
Current Usage part works well for Impala users but will fail for Hive users because of the `auth_mechanism` default value.
This adds a comment targeted towards Hive users so they can quick start too.
ImpalaService.thrift is updated to contain CloseImpalaOperation,
which can be used get the number of modified rows in DMLs.
This is not just a  copy, some parts of ImpalaService.thrift are
not included to avoid pulling in more Thrift files as dependencies.

Also updated process_thrift.sh to work with current Impala env vars.
sqlalchemy 2 (now default on pip in Python 3) removed some
functions used in tests. Updated these to work both with
sqlalchemy 2.* and 1.* (>=1.2).
* Support Cursor.rowcount and close finished queries

With current Impala server rowcount support needs DMLs to be
closed with CloseImpalaOperation() as there is no simpler way
to get the number of modifed rows.
See https://issues.apache.org/jira/browse/IMPALA-12647 for
alternatives.

This change adds option close_finished_queries for cursors
with default True. Setting it to False brings back the old
behavior.

If queries are closed after finishing queries, calling get_log
RPC is no longer possible. If close_finished_queries is true
then the logs are fetched and stored before closing to query
to be able to return the saved results with get_log. Generally
get_log shouldn't be a too expensive RPC.

Another potential side-effect is that get_profile may fail as
Impala can discard the runtime profile after the query is
closed (see Impala flag query_log_size).

Despite the above side effects closing the queries seems a better
default behavior as it helps avoiding queries hanging in the
"waiting to be closed" state and provides reliable rowcount. This
is also consistent with the way impala-shell works.

Testing:
- rowcount already had good coverage in DBAPI2 compliance tests
  (e.g. test_mixedfetch)
- new tests were added for some missing rowcount cases and for
  getting warning/error log for closed queries

* Fix review comments
The old version used deprecated functions that were
removed in Python 3.12.

The change only contains code generated by:
versioneer install
Python 3.12 removed deprecated certfile and key_file
arguments from http_client.HTTPSConnection. These should
be always empty in Impyla as the server is never verified
in https connections (see #362).
* Add tox.ini to help testing with multiple python versions

* Revert "Update versioneer to 0.29 (needed for Python 3.12) (#532)"

This reverts commit b98ffef.
Co-authored-by: David Hulsman <david.hulsman@tennet.eu>
@thyarles thyarles requested a review from danielamguerra March 25, 2024 21:51
Copy link
Copy Markdown
Member Author

@thyarles thyarles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

csringhofer and others added 3 commits April 15, 2024 17:41
This function is called for every query during normal execution, making
this info level too verbose.
* Add text() wrapper for metadata queries.

Remove tablename from retrieve columnname results.

* Update sqlalchemy.py

remove tablename from get_columns result.

* replace 'r' in re.sub argument
@thyarles thyarles marked this pull request as draft April 16, 2024 16:10
@thyarles thyarles added the wontfix This will not be worked on label Apr 16, 2024
WWakker and others added 7 commits August 7, 2024 19:50
Adds wildcard ('*') support to the `http_cookie_names` connect property
to preserve all cookies returned by the server. Preserves prior behavior
for any other value of `http_cookie_names`.
* Avoid retrying non-idempotent RPCs in binary connections (#549)

See the #549 for the detailed analyses of the issue.

The fix works similarly to the existing solution for http connections:
- each RPC knows whether it is idempotent
- if the error comes from establishing the connection, then retry
- if the error comes from executing the RPC, only retry if the RPC
  is idempotent

A test is added that relies on slow metadata handling in the
Impala cluster to trigger timouts. It would be nice to add wider
and more reliable tests in the future similarly to the http tests
in test_hs2_fault_injection.py

* Fix review comments

* Fix review comment
The goal is to support "long poll" (IMPALA-13294). When query option
long_polling_time_ms is set, the impala server will wait in
GetOperationStatus for this time (or until the query status changes).
This allows detecting earlier that a query has finished without making
GetOperationStatus RPCs more frequent.

If long_polling_time_ms is not used then the effect should be minor -
GetOperationStatus is quick RPC so the time it takes should mainly
come from network delay.

_get_sleep_interval() is not changed (min 0.01s, max 1s) to avoid
regression in existing use cases. It could be useful to override
this in a later patch based on the value of long_polling_time_ms.
Supported Python versions are also updated in setup.py.
The issue was introduced in #542.
Caught by Impala's LdapImpylaHttpTest.
csringhofer and others added 30 commits March 6, 2025 21:13
This patch update impala/thrift/ to follow latest definition from
Impala 4.5.0.

Did some hand edit in ImpalaService.thrift to exclude thrift files that
is unrelated to query profile such as Frontend.thrift,
BackendGflags.thrift, and Query.thrift. Updated DEVELOP.md and
process_thrift.sh to mention this issue. Use IMPALA_THRIFT_PY_VERSION
instead of IMPALA_THRIFT_CPP_VERSION. Both point to thrift-0.16.0.

Testing:
- Update test_get_log to validate attributes existence.
- Run test with following command:
  tox -- -ktest_get_log
This avoids a warning in sqlalchemy. It is not clear to me whether
it could be enabled, it would need a deeper dive into sqlalchemy.
Other dialects I checked also set it to False.
Python 2.7 needs bitarray < 3, but on Python bitarray 3.* can be used.
The bitarray functionality used in impyla is minimal and didn't change
with the major version bump.
This fixes the following warning:
ADeprecationWarning: The dbapi() classmethod on dialect classes has been renamed to import_dbapi().

The issue was mentioned in #214
* Update build_summary_table function

This patch update build_summary_table to match the same function in
impala-shell.
https://github.com/apache/impala/blob/a07bf84/shell/impala_client.py#L113

Testing:
Run and pass following command
```
tox -- -ktest_build_summary_table
```

* Copy exec_summary.py from apache/impala@e73e2d4

* Remove one more cur.close_operation()
Winkerberos is an alternative to kerberos package on Windows and
has the same api, so can be used as a drop in replacement.

Based on another PR:
#504

The difference is that the current patch prefers kerberos if both are
available to avoid the breaking existing workflows. setup.py is
also not modified. To use winkerberos it has to be installed
independently from impyla and impyla should be installed without
[kerberos] extra.
test_build_summary_table failed based on what other tests ran
in the same session due to using the same session level cursor
which could have altered query options from other tests.
The patch switches to function level cursor to isolate tests and
creates new fixture session_cur to use in session/module level
fixtures that need a cursor.

Also fixes test_pandas_dataframe_to_sql on Python 3.7 by changing
tox to use 1.* sqlalchemy
Also removed Python 3.6 (EOL since 2021) from the supported versions.
It is still likely to work but testing it is a nuisance.
The validation tried to install impyla in an environment with
very old setuptools (18.0.1) that didn't support the environment
markers (e.g 'bitarray<3; python_version < "3"') added in #588.
Installing fresh setuptools didn't work with with easy_install
so moved to using pip in the py2.7 env too.
Simply removed the checks against passing ca_cart when
using hs2-http.

Testing:
- added test_https_connection_with_cert
- added ssl/https tests for rejected certificates (with
  Impala's incorrect-commonname-cert.pem)
- tested with Python3.9 - 3.12
Add verify_cert arg to connect() to verify the server when use_ssl=True
but ca_cart is not. In this case if verify_cert=True then Python will
validate the server using default CA certs.

For details of default CA handling in Python see:
https://docs.python.org/3/library/ssl.html

Testing:
- added negative tests for connecting to Impala with self signed cert
- no positive tests - these would need a proper certificate or
  forcing Python to accept a self-signed certificate
* resilient columnar batch handling in hiveserver2.py

fixes #596

* add test_has_more_rows_fetchcolumnar

* fix fetchcolumnar assertion
If FetchResults is called while the query is still executing, the
TFetchResultsResp will not contain a TRowSet. This change allows the
client to handle that case without throwing an exception.

Fixes #594.
With this change impyla tests pass from Python 3.8 to 3.14.
With 3.7 still most tests pass, with the exception of
test_pandas_dataframe_to_sql which fails due to older pandas
version.

Dependencies are unpinned in tox.ini - this means that new versions
of libraries may break tests without any change in impyla, allowing
catching such issues earlier. pytest is also unpinned, leading to
using 8.* on Python 3.8/3.9 and 9.* on newer Pythons.
Also removes long deprecated ez_setup.py
Also fixes comment impala/__init__.py.
impyla uses pure-sasl directly but it was not mentioned
in setup.py. Installs still worked due to indirectly
depending on it through thrift_sasl. Added the same
dependency as the one in thrift_sasl.
THRIFT-792 added inner exception to TSocket.open(). This revealed
that test_ssl_connection_default_certs failed for a different reason
than I assumed - instead of failing to verify it rejects to verify
when ca_cert=None. To actually verify with default certs TSSLSocket's
constructor needs to be called with ssl context.
Note that having positive tests would also have catched this earlier.

Thrift version is constrained to >= 0.23 instead of pinning it.
Generated py files (created with Thrift 0.16) are not updated.

Also changed ImpalaConnectionTests's _execute_queries to do simple
SELECT queries instead of CREATE+DROP table as some tests were
flaky.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

wontfix This will not be worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.