Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process: make mypy happy with type annotations #1036

Merged
merged 23 commits into from
Nov 8, 2021
Merged

Conversation

plamut
Copy link
Contributor

@plamut plamut commented Oct 28, 2021

Closes #987.

PR checklist:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

@plamut plamut requested review from tswast and a team October 28, 2021 16:29
@plamut plamut requested a review from a team as a code owner October 28, 2021 16:29
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Oct 28, 2021
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Oct 28, 2021
@plamut plamut changed the title process: add mypy types check process: make mypy happy with type annotations Oct 28, 2021
@plamut
Copy link
Contributor Author

plamut commented Oct 28, 2021

I tried to keep the type: ignore directives to the minimum, but some cases were just too complex for mypy or clear false positives.

google/cloud/bigquery/external_config.py Outdated Show resolved Hide resolved
google/cloud/bigquery/job/base.py Outdated Show resolved Hide resolved
google/cloud/bigquery/job/query.py Outdated Show resolved Hide resolved
@@ -144,7 +144,7 @@ def from_api_repr(cls, stats: Dict[str, str]) -> "DmlStats":

args = (
int(stats.get(api_field, default_val))
for api_field, default_val in zip(api_fields, cls.__new__.__defaults__)
for api_field, default_val in zip(api_fields, cls.__new__.__defaults__) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is barfing because the Dict[str, str] above mismatches the int of the __defaults__. Does the backend really return the counts as strings? That would seem odd, unless they are expecting to be returning > 2^64 values.

Copy link
Contributor Author

@plamut plamut Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the error here was Callable[...] does not have the attribute __defaults__, or something like that. I actually expected mypy to recognize that cls is a namedtuple and its __new__() has the attribute __defaults__, but it seems that it did not for some reason.

Maybe if I explicitly annotate cls, hm...

Update: Nope, sadly that did not work.

@@ -130,7 +130,7 @@ def _view_use_legacy_sql_getter(table):
class _TableBase:
"""Base class for Table-related classes with common functionality."""

_PROPERTY_TO_API_FIELD = {
_PROPERTY_TO_API_FIELD: Dict[str, Union[str, List[str]]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, the mix-and-match of strings / lists of strings here has always given me heartburn. I'm not surprised mypy bitched about it, and frankly, giving it a union type doesn't make the pattern any better.

google/cloud/bigquery/table.py Outdated Show resolved Hide resolved
google/cloud/bigquery/client.py Outdated Show resolved Hide resolved
@@ -526,12 +538,12 @@ def _ensure_bqstorage_client(
bqstorage_client = bigquery_storage.BigQueryReadClient(
credentials=self._credentials,
client_options=client_options,
client_info=client_info,
client_info=client_info, # type: ignore # (None is also accepted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update google-cloud-bigquery-storage / the generated clients to allow None for client_info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears so, yes. The docstring says that None is allowed, but mypy complains that Optional[...ClientInfo] is not acceptable here as an argument.

It's probably because the default value of client_info is not a literal None, but DEFAULT_CLIENT_INFO instead.

google/cloud/bigquery/client.py Outdated Show resolved Hide resolved
@@ -1934,7 +1944,7 @@ def create_job(
return self.load_table_from_uri(
source_uris,
destination,
job_config=load_job_config,
job_config=typing.cast(LoadJobConfig, load_job_config),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised the cast is needed here. Would specifying the type of load_job_config at assignment time do the same thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise for the other config objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that from_api_repr() is defined on the base _JobConfig class and its return type is annotated as _JobConfig. Which makes sense on the base class, but in practice the actual cls subclass is chosen here dynamically by introspecting job_config, and mypy cannot understand that.

Specifying the type of load_job_config at assignment time would not help, because again a base type (_JobConfig) is assigned to load_job_config with a specific sub-type, and mypy would still complain. Cast solves this.

The alternative could be changing the return type on the base class to include all sub-classes, but that doesn't feel right. It can get of of sync if a new sub-type is created, and a circular import would be needed.
Besides, mypy would probably still complain if "Union[All, Config, Subtypes]" is assigned to a specific sub-class type.

@@ -2064,15 +2077,19 @@ def get_job(
timeout=timeout,
)

return self.job_from_resource(resource)
job_instance = self.job_from_resource(resource) # never an UnknownJob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure about that?

I recall that UnknownJob was introduced in list_jobs because someone might have permission to know that a job happened but not fetch the details about it.

I'd be interested in a little experiment to see what happens when we try to call get_job on one of those unknown jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... I was under impression that job_from_resource() is a helper that only returns UnknownJob if somebody manually provides an "invalid"/incomplete resource dict.

However, if the backend can also return such resource (e.g. due to permissions), the assumption in the comment does not hold (although the return type in the docstring already excludes UnknownJob).

Do you maybe remember the combination of permissions that result in an UnknownJob?
(also, I might need assistance here, because my access to cloud console is now severely limited - I'll have to check)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you maybe remember the combination of permissions that result in an UnknownJob?
(also, I might need assistance here, because my access to cloud console is now severely limited - I'll have to check)

It related to creating a job with one account and trying to list all jobs for the project with another account.

See: jobs.list versus jobs.listAlll here: https://cloud.google.com/bigquery/docs/access-control

Permission Description
bigquery.jobs.list List all jobs and retrieve metadata on any job submitted by any user.1 For jobs submitted by other users, details and metadata are redacted.
bigquery.jobs.listAll List all jobs and retrieve metadata on any job submitted by any user.1

1 For any job you create, you automatically have the equivalent of the bigquery.jobs.get and bigquery.jobs.update permissions for that job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast My access to Cloud Console is limited now and I cannot create an account anymore that only has bigquery.jobs.list permission (and possibly a few others to access jobs on the project), while the current development account that I use has the bigquery.jobs.listAll permission.

In other words, I don't think I can actually retrieve an UnknownJob.

Could you run the below script and see what happens if we call get_job() on an UnknownJob?

unknown_jobs.py
from collections import Counter

from google.cloud import bigquery
from google.oauth2 import service_account


PROJECT_ID = "TODO: set"
DATASET_ID = "TODO: set"
TABLE_ID = "TODO: set"

MAIN_CRED_FILE = "TODO: /path/to/service_account.json"

# NOTE: This one must have "bigquery.jobs.list" permission, but NOT "bigquery.jobs.listAll".
# We want metadata of jobs submitted by other users redacted.
OTHER_CRED_FILE = "TODO: /path/to/other/service_account.json"


def main():
    credentials = service_account.Credentials.from_service_account_file(MAIN_CRED_FILE)
    client = bigquery.Client(credentials=credentials)

    # Create table if needed.
    table_name_full = f"{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}"
    table = client.create_table(table_name_full, exists_ok=True)

    # Create at least one job (load job in this case).
    job_config = bigquery.LoadJobConfig(
        schema=[
            bigquery.SchemaField("user_id", "STRING"),
            bigquery.SchemaField("name", "STRING"),
        ],
        write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
    )
    row = {"user_id": "1234567890", "name": "John"}
    load_job = client.load_table_from_json([row], destination=table, job_config=job_config)
    result = load_job.result()
    print(result)

    print("=" * 20)

    # Using another client, use other users' jobs. The client should NOT have
    # "bigquery.jobs.listAll" permission.
    credentials = service_account.Credentials.from_service_account_file(OTHER_CRED_FILE)
    other_client = bigquery.Client(credentials=credentials)

    jobs_iter = other_client.list_jobs(
        project=PROJECT_ID, max_results=10_000, all_users=True
    )
    retrieved_jobs = list(jobs_iter)

    jobs_count = Counter(type(job) for job in retrieved_jobs)
    print(jobs_count)

    try:
        unknown_job = next(
            job for job in retrieved_jobs if isinstance(job, bigquery.UnknownJob)
        )
    except StopIteration:
        print("No UnknownJobs received, cannot test get_job().")
        return
    
    unknown_job = other_client.get_job(unknown_job)
    print(unknown_job._properties)


if __name__ == "__main__":
    main()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I preemptively changed the return type annotation to also include UnknownJob.

google/cloud/bigquery/client.py Outdated Show resolved Hide resolved
noxfile.py Show resolved Hide resolved
@plamut
Copy link
Contributor Author

plamut commented Oct 31, 2021

A subtlety here is that type checks will only pass of google-api-core>=2.1.1 is installed when type check support was added.

As the time passes, this will eventually become a non-issue, but in the transitional period people might experience failures if they have less recent python-api-core installed. Those users would likely have to ignore google-cloud-bigquery when type-checking their projects.

The Opentelemetry APi has changed from the minimum version the
BigQuery client currently uses, we thus need to bound the maximum
Opentelemetry version.

In addition, that maximum version does not yet support type checks, thus
it is ignored.
setup.py Outdated Show resolved Hide resolved
@plamut plamut added the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 31, 2021
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 31, 2021
@plamut plamut requested a review from a team as a code owner October 31, 2021 10:09
@plamut plamut requested a review from parthea October 31, 2021 10:09
@plamut plamut added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 31, 2021
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 31, 2021
@plamut
Copy link
Contributor Author

plamut commented Oct 31, 2021

It appears some more adjustments are needed to opentelemetry tests (or code).

Update: The reason why opentelemetry tests fail in Python 3.7+ is because only in Python 3.6 we pin the dependency to 0.11b0. Other Python sessions use a more recent version, but since 0.12b0 OpenTelemetry prevents overriding TracerProvider (it can only be set once).

This restriction affects our fixtures in some way, causing the opentelemetry tests to fail if run in the entire test suite. A short term workaround could be pinning opentelemetry to 0.11b0 and dealing with it in #1038.

@tswast
Copy link
Contributor

tswast commented Nov 8, 2021

🎉

nox > Running session mypy
nox > Creating virtual environment (virtualenv) using python3.8 in .nox/mypy
nox > python -m pip install -e .[all]
nox > python -m pip install ipython
nox > python -m pip install mypy==0.910
nox > python -m pip install types-protobuf types-python-dateutil types-requests types-setuptools
nox > mypy google/cloud
Success: no issues found in 49 source files
nox > Session mypy was successful.

@tswast tswast merged commit 66b3dd9 into googleapis:main Nov 8, 2021
@plamut plamut deleted the iss-987 branch November 9, 2021 09:19
abdelmegahedgoogle pushed a commit to abdelmegahedgoogle/python-bigquery that referenced this pull request Apr 17, 2023
* process: add mypy types check to nox sessions

* Ignore type errors for not annotated modules

Several dependencies lack type annotations, or they don't advertise
themselves as type-annotated. We do not want `mypy` to complain about
these.

* Fix mypy complaints (batch 1)

* Fix mypy complaints (batch 2)

* Fix mypy complaints (batch 3)

* Fix mypy false positive errors

* Simplify external config options instantiation

* Do not ignore api-core in type checks

More recent releases of google-api-core have typing enabled.

* Remove unneeded __hash__ = None lines

* Use an alias for timeout type in client.py

* Fix PathLike subscription error in pre-Python 3.9

* Fix a typo in docstring

Co-authored-by: Tim Swast <swast@google.com>

* Add mypy to the list of nox sessions to run

* Fix opentelemetry type error

The Opentelemetry APi has changed from the minimum version the
BigQuery client currently uses, we thus need to bound the maximum
Opentelemetry version.

In addition, that maximum version does not yet support type checks, thus
it is ignored.

* 🦉 Updates from OwlBot

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* Exclude type-checking code from coverage

* Fix patching opentelemetry tracer pvoider

* Adjust get_job() return type, ignore opentelemetry

Co-authored-by: Tim Swast <swast@google.com>
Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

type check with mypy as well
4 participants