Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: test other redaction libraries #418

Conversation

KrishPatel13
Copy link
Collaborator

What kind of change does this PR introduce?
This PR has some starter code to just check if the other available redaction libraries are reliable to use/integrate them in openadapt

Summary

Checklist

  • My code follows the style guidelines of OpenAdapt
  • I have performed a self-review of my code
  • If applicable, I have added tests to prove my fix is functional/effective
  • I have linted my code locally prior to submission
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (e.g. README.md, requirements.txt)
  • New and existing unit tests pass locally with my changes

How can your code be run and tested?

Other information

@KrishPatel13 KrishPatel13 self-assigned this Jul 21, 2023
@KrishPatel13 KrishPatel13 marked this pull request as draft July 21, 2023 16:26
@KrishPatel13
Copy link
Collaborator Author

KrishPatel13 commented Jul 21, 2023

Testing OpenRedact/anonymizer:

Got this error:

(openadapt-py3.10) PS P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction\anonymizer> python .\test_openredact_anonymizer.py
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-DSRh12US-py3.10\lib\site-packages\pydantic\_internal\_config.py:261: UserWarning: Valid config keys have changed in V2:
* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
  warnings.warn(message, UserWarning)
Traceback (most recent call last):
  File "P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction\anonymizer\test_openredact_anonymizer.py", line 1, in <module>
    from anonymizer.anonymization.anonymizer import Anonymizer
  File "P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction\anonymizer\anonymizer\__init__.py", line 1, in <module>
    from anonymizer.anonymization.config import AnonymizerConfig  # noqa: F401
  File "P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction\anonymizer\anonymizer\anonymization\config.py", line 3, in <module>
    from ..mechanisms import mechanism_config_types, mechanism_types
  File "P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction\anonymizer\anonymizer\mechanisms\__init__.py", line 1, in <module>
    from ._type_helpers import mechanism_config_types, mechanism_types, is_config  # noqa: F401
  File "P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction\anonymizer\anonymizer\mechanisms\_type_helpers.py", line 3, in <module>
    from .generalization import GeneralizationParameters, Generalization
  File "P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction\anonymizer\anonymizer\mechanisms\generalization.py", line 5, in <module>
    from .stateful_mechanism import StatefulMechanism
  File "P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction\anonymizer\anonymizer\mechanisms\stateful_mechanism.py", line 9, in <module>
    class StatefulMechanism(CamelBaseModel, abc.ABC):
  File "P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction\anonymizer\anonymizer\mechanisms\stateful_mechanism.py", line 22, in StatefulMechanism
    anonymizations: Dict[str, str] = Field(default_factory=dict, const=True)
  File "C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-DSRh12US-py3.10\lib\site-packages\pydantic\fields.py", line 723, in Field
    raise PydanticUserError('`const` is removed, use `Literal` instead', code='removed-kwargs')
pydantic.errors.PydanticUserError: `const` is removed, use `Literal` instead

For further information visit https://errors.pydantic.dev/2.0.3/u/removed-kwargs

On seraching about PydanticUserError('constis removed, useLiteral instead' got to know that this might be developed due to conflicting dependencies with the pydantic

Ran (pip list):

Package            Version
------------------ --------
annotated-types    0.5.0
anonympy           0.3.7
anyio              3.7.1
asttokens          2.2.1
atomicwrites       1.4.1
attrs              23.1.0
beautifulsoup4     4.9.1
bidict             0.22.1
blis               0.7.9
catalogue          2.0.8
certifi            2023.5.7
cffi               1.15.1
cfgv               3.3.1
charset-normalizer 3.2.0
click              8.1.4
clr-loader         0.2.5
colorama           0.4.6
confection         0.1.0
coverage           7.2.7
cymem              2.0.7
decorator          5.1.1
defusedxml         0.6.0
distlib            0.3.7
en-core-web-sm     3.6.0
exceptiongroup     1.1.2
executing          1.2.0
expose-text        0.1.6
Faker              19.2.0
filelock           3.12.2
frozenlist         1.3.3
fsspec             2023.6.0
huggingface-hub    0.16.4
humanfriendly      10.0
identify           2.5.25
idna               3.4
Jinja2             3.1.2
joblib             1.3.1
jsonschema         3.2.0
langcodes          3.3.0
lazy-object-proxy  1.9.0
MarkupSafe         2.1.3
more-itertools     9.1.0
mpmath             1.3.0
multidict          6.0.4
murmurhash         1.0.9
networkx           3.1
nodeenv            1.8.0
numpy              1.25.1
opencv-python      4.8.0.74
packaging          23.1
pandas             2.0.3
parso              0.8.3
pathy              0.10.2
pdf2image          1.16.3
pdfkit             0.6.1
pdfrw              0.4
Pillow             10.0.0
pip                23.1.2
platformdirs       3.9.1
pluggy             0.13.1
poppler-utils      0.1.0
pre-commit         2.16.0
preshed            3.0.8
pure-eval          0.2.2
py                 1.11.0
pycparser          2.21
pycryptodome       3.18.0
pydantic           2.0.3
pydantic_core      2.3.0
PyPDF2             3.0.1
pyreadline3        3.4.1
pyrsistent         0.19.3
pytesseract        0.3.10
pytest             5.4.3
pytest-cov         2.10.0
python-dateutil    2.8.2
python-engineio    4.5.1
pytz               2023.3
PyYAML             6.0
regex              2023.6.3
requests           2.31.0
requests-file      1.5.1
rfc3339            6.2
safetensors        0.3.1
scikit-learn       1.3.0
scipy              1.11.1
setuptools         49.1.2
six                1.16.0
smart-open         6.3.0
sniffio            1.3.0
soupsieve          2.4.1
spacy              3.6.0
spacy-legacy       3.0.12
spacy-loggers      1.0.4
srsly              2.4.6
starlette          0.27.0
sympy              1.12
texttable          1.6.7
thinc              8.1.10
threadpoolctl      3.2.0
tokenizers         0.13.3
toml               0.10.2
tqdm               4.64.0
traitlets          5.9.0
transformers       4.31.0
typer              0.9.0
typing_extensions  4.7.1
tzdata             2023.3
urllib3            2.0.3
validators         0.20.0
virtualenv         20.24.1
wasabi             1.1.2
wcwidth            0.2.6
wheel              0.40.0
wkhtmltopdf        0.2
wrapt              1.15.0

On searching more I found out a similar issue at pydantic library: pydantic/pydantic#561
On reading the thread looks like that anonymizer (from OpenRedact) is not compatible with pydantic as of now.

Also, OpenRedact/anonymizer has a Disclaimer that it is a Prototype and it is not being updated since Jan 2022 as of now.

@KrishPatel13
Copy link
Collaborator Author

Testing Open Data Anonymizer:

(openadapt-py3.10) PS P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction> pip install anonympy

Requirement already satisfied: anonympy in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (0.3.7)
Requirement already satisfied: faker in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (19.2.0)
Requirement already satisfied: scikit-learn in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (1.3.0)
Requirement already satisfied: opencv-python in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (4.8.0.74)
Requirement already satisfied: texttable in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (1.6.7)
Requirement already satisfied: setuptools in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (49.1.2)
Requirement already satisfied: numpy in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (1.25.1)
Requirement already satisfied: pandas in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (2.0.3)
Requirement already satisfied: validators in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (0.20.0)
Requirement already satisfied: pycryptodome in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (3.18.0)
Requirement already satisfied: requests in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (2.31.0)
Requirement already satisfied: pyyaml in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (6.0)
Requirement already satisfied: rfc3339 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (6.2)
Requirement already satisfied: pytesseract in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (0.3.10)
Requirement already satisfied: PyPDF2 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (3.0.1)
Requirement already satisfied: poppler-utils in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (0.1.0)
Requirement already satisfied: pdf2image in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (1.16.3)
Requirement already satisfied: transformers in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from anonympy) (4.31.0)
Requirement already satisfied: python-dateutil>=2.4 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from faker->anonympy) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from pandas->anonympy) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from pandas->anonympy) (2023.3)
Requirement already satisfied: pillow in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from pdf2image->anonympy) (10.0.0)
Requirement already satisfied: Click>=7.0 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from poppler-utils->anonympy) (8.1.4)
Requirement already satisfied: packaging>=21.3 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from pytesseract->anonympy) (23.1)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from requests->anonympy) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from requests->anonympy) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from requests->anonympy) (2.0.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from requests->anonympy) (2023.5.7)
Requirement already satisfied: scipy>=1.5.0 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from scikit-learn->anonympy) (1.11.1)
Requirement already satisfied: joblib>=1.1.1 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from scikit-learn->anonympy) (1.3.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from scikit-learn->anonympy) (3.2.0)
Requirement already satisfied: filelock in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from transformers->anonympy) (3.12.2)
Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from transformers->anonympy) (0.16.4)
Requirement already satisfied: regex!=2019.12.17 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from transformers->anonympy) (2023.6.3)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from transformers->anonympy) (0.13.3)
Requirement already satisfied: safetensors>=0.3.1 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from transformers->anonympy) (0.3.1)
Requirement already satisfied: tqdm>=4.27 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from transformers->anonympy) (4.64.0)
Requirement already satisfied: decorator>=3.4.0 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from validators->anonympy) (5.1.1)
Requirement already satisfied: colorama in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from Click>=7.0->poppler-utils->anonympy) (0.4.6)
Requirement already satisfied: fsspec in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from huggingface-hub<1.0,>=0.14.1->transformers->anonympy) (2023.6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from huggingface-hub<1.0,>=0.14.1->transformers->anonympy) (4.7.1)
Requirement already satisfied: six>=1.5 in c:\users\krish patel\appdata\local\pypoetry\cache\virtualenvs\openadapt-dsrh12us-py3.10\lib\site-packages (from python-dateutil>=2.4->faker->anonympy) (1.16.0)

[notice] A new release of pip is available: 23.1.2 -> 23.2
[notice] To update, run: python.exe -m pip install --upgrade pip
(openadapt-py3.10) PS P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction> python .\test_open_data_anony_pdf.py
Traceback (most recent call last):
  File "P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction\test_open_data_anony_pdf.py", line 1, in <module>
    from anonympy.pdf import pdfAnonymizer
  File "C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-DSRh12US-py3.10\lib\site-packages\anonympy\__init__.py", line 1, in <module>
    from anonympy import pandas
  File "C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-DSRh12US-py3.10\lib\site-packages\anonympy\pandas\__init__.py", line 6, in <module>
    from anonympy.pandas.core_pandas import dfAnonymizer
  File "C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-DSRh12US-py3.10\lib\site-packages\anonympy\pandas\core_pandas.py", line 6, in <module>
    from cape_privacy.pandas import dtypes
ModuleNotFoundError: No module named 'cape_privacy'
(openadapt-py3.10) PS P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction> pip install cape-privacy
 ...
 /Tcnumpy\core\src\multiarray\scalarapi.c /Fobuild\temp.win-amd64-3.10\Release\numpy\core\src\multiarray\scalarapi.obj
        C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.36.32532\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -DNPY_INTERNAL_BUILD=1 -DHAVE_NPY_CONFIG_H=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -Ibuild\src.win-amd64-3.1\numpy\core\src\private -Inumpy\core\include -Ibuild\src.win-amd64-3.1\numpy\core\include/numpy -Inumpy\core\src\private -Inumpy\core\src -Inumpy\core -Inumpy\core\src\npymath -Inumpy\core\src\multiarray -Inumpy\core\src\umath -Inumpy\core\src\npysort -I"C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-DSRh12US-py3.10\include" -I"C:\Program Files\Python310\include" -I"C:\Program Files\Python310\Include" -Ibuild\src.win-amd64-3.1\numpy\core\src\private -Ibuild\src.win-amd64-3.1\numpy\core\src\npymath -Ibuild\src.win-amd64-3.1\numpy\core\src\private -Ibuild\src.win-amd64-3.1\numpy\core\src\npymath -Ibuild\src.win-amd64-3.1\numpy\core\src\private -Ibuild\src.win-amd64-3.1\numpy\core\src\npymath -I"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.36.32532\include" -I"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.36.32532\ATLMFC\include" -I"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" -I"C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" -I"C:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" -I"C:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" -I"C:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" -I"C:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" -I"C:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" -I"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" -I"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" -I"C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" -I"C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\shared" -I"C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\um" -I"C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\winrt" /Tcbuild\src.win-amd64-3.1\numpy\core\src\multiarray\scalartypes.c /Fobuild\temp.win-amd64-3.10\Release\build\src.win-amd64-3.1\numpy\core\src\multiarray\scalartypes.obj
        error: Command "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.36.32532\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -DNPY_INTERNAL_BUILD=1 -DHAVE_NPY_CONFIG_H=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -Ibuild\src.win-amd64-3.1\numpy\core\src\private -Inumpy\core\include -Ibuild\src.win-amd64-3.1\numpy\core\include/numpy -Inumpy\core\src\private -Inumpy\core\src -Inumpy\core -Inumpy\core\src\npymath -Inumpy\core\src\multiarray -Inumpy\core\src\umath -Inumpy\core\src\npysort -I"C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-DSRh12US-py3.10\include" -I"C:\Program Files\Python310\include" -I"C:\Program Files\Python310\Include" -Ibuild\src.win-amd64-3.1\numpy\core\src\private -Ibuild\src.win-amd64-3.1\numpy\core\src\npymath -Ibuild\src.win-amd64-3.1\numpy\core\src\private -Ibuild\src.win-amd64-3.1\numpy\core\src\npymath -Ibuild\src.win-amd64-3.1\numpy\core\src\private -Ibuild\src.win-amd64-3.1\numpy\core\src\npymath -I"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.36.32532\include" -I"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.36.32532\ATLMFC\include" -I"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" -I"C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" -I"C:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" -I"C:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" -I"C:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" -I"C:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" -I"C:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" -I"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" -I"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" -I"C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" -I"C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\shared" -I"C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\um" -I"C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\winrt" /Tcbuild\src.win-amd64-3.1\numpy\core\src\multiarray\scalartypes.c /Fobuild\temp.win-amd64-3.10\Release\build\src.win-amd64-3.1\numpy\core\src\multiarray\scalartypes.obj" failed with exit status 2
        scalartypes.c
        numpy\core\include\numpy/npy_3kcompat.h(198): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\common.h(269): warning C4244: 'return': conversion from 'npy_intp' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(483): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(483): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(483): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(482): warning C4996: 'PyUnicode_AsUnicode': deprecated in 3.3
        numpy\core\src\multiarray\scalartypes.c.src(483): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
        numpy\core\src\multiarray\scalartypes.c.src(488): warning C4996: 'PyUnicode_FromUnicode': deprecated in 3.3
        numpy\core\src\multiarray\scalartypes.c.src(483): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(482): warning C4996: 'PyUnicode_AsUnicode': deprecated in 3.3
        numpy\core\src\multiarray\scalartypes.c.src(483): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
        numpy\core\src\multiarray\scalartypes.c.src(488): warning C4996: 'PyUnicode_FromUnicode': deprecated in 3.3
        numpy\core\src\multiarray\scalartypes.c.src(516): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(517): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(1912): warning C4244: 'function': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(1912): warning C4244: 'function': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(1866): warning C4996: 'PyUnicode_AsUnicode': deprecated in 3.3
        numpy\core\src\multiarray\scalartypes.c.src(1867): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
        numpy\core\src\multiarray\scalartypes.c.src(1871): warning C4996: 'PyObject_AsReadBuffer': deprecated in 3.0
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2788): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2768): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(2788): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
        numpy\core\src\multiarray\scalartypes.c.src(3228): error C2440: 'function': cannot convert from 'double' to 'PyObject *'   
        numpy\core\src\multiarray\scalartypes.c.src(3228): warning C4024: '_Py_HashDouble': different types for formal and actual parameter 1
        numpy\core\src\multiarray\scalartypes.c.src(3228): error C2198: '_Py_HashDouble': too few arguments for call
        numpy\core\src\multiarray\scalartypes.c.src(3237): error C2440: 'function': cannot convert from 'double' to 'PyObject *'   
        numpy\core\src\multiarray\scalartypes.c.src(3237): warning C4024: '_Py_HashDouble': different types for formal and actual parameter 1
        numpy\core\src\multiarray\scalartypes.c.src(3236): error C2198: '_Py_HashDouble': too few arguments for call
        numpy\core\src\multiarray\scalartypes.c.src(3243): error C2440: 'function': cannot convert from 'double' to 'PyObject *'   
        numpy\core\src\multiarray\scalartypes.c.src(3243): warning C4024: '_Py_HashDouble': different types for formal and actual parameter 1
        numpy\core\src\multiarray\scalartypes.c.src(3242): error C2198: '_Py_HashDouble': too few arguments for call
        numpy\core\src\multiarray\scalartypes.c.src(3228): error C2440: 'function': cannot convert from 'npy_longdouble' to 'PyObject *'
        numpy\core\src\multiarray\scalartypes.c.src(3228): warning C4024: '_Py_HashDouble': different types for formal and actual parameter 1
        numpy\core\src\multiarray\scalartypes.c.src(3228): error C2198: '_Py_HashDouble': too few arguments for call
        numpy\core\src\multiarray\scalartypes.c.src(3237): error C2440: 'function': cannot convert from 'npy_longdouble' to 'PyObject *'
        numpy\core\src\multiarray\scalartypes.c.src(3237): warning C4024: '_Py_HashDouble': different types for formal and actual parameter 1
        numpy\core\src\multiarray\scalartypes.c.src(3236): error C2198: '_Py_HashDouble': too few arguments for call
        numpy\core\src\multiarray\scalartypes.c.src(3243): error C2440: 'function': cannot convert from 'npy_longdouble' to 'PyObject *'
        numpy\core\src\multiarray\scalartypes.c.src(3243): warning C4024: '_Py_HashDouble': different types for formal and actual parameter 1
        numpy\core\src\multiarray\scalartypes.c.src(3242): error C2198: '_Py_HashDouble': too few arguments for call
        numpy\core\src\multiarray\scalartypes.c.src(3258): error C2440: 'function': cannot convert from 'double' to 'PyObject *'   
        numpy\core\src\multiarray\scalartypes.c.src(3258): warning C4024: '_Py_HashDouble': different types for formal and actual parameter 1
        numpy\core\src\multiarray\scalartypes.c.src(3258): error C2198: '_Py_HashDouble': too few arguments for call
        numpy\core\src\multiarray\scalartypes.c.src(4478): warning C4244: 'return': conversion from 'npy_intp' to 'int', possible loss of data
        [end of output]

        note: This error originates from a subprocess, and is likely not a problem with pip.
        ERROR: Failed building wheel for numpy
        Running setup.py clean for numpy
        error: subprocess-exited-with-error

        python setup.py clean did not run successfully.
        exit code: 1

        [10 lines of output]
        Running from numpy source directory.

        `setup.py clean` is not supported, use one of the following instead:

          - `git clean -xdf` (cleans all files)
          - `git clean -Xdf` (cleans all versioned files, doesn't touch
                              files that aren't checked into the git repo)

        Add `--force` to your command to use it anyway if you must (unsupported).

        [end of output]

        note: This error originates from a subprocess, and is likely not a problem with pip.
        ERROR: Failed cleaning build dir for numpy
      Failed to build numpy
      ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects

      [notice] A new release of pip is available: 23.1.2 -> 23.2
      [notice] To update, run: python.exe -m pip install --upgrade pip
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

[notice] A new release of pip is available: 23.1.2 -> 23.2
[notice] To update, run: python.exe -m pip install --upgrade pip

@abrichr
Copy link
Member

abrichr commented Jul 24, 2023

Please submit an issue to https://github.com/openredact/anonymizer/issues and https://github.com/ArtLabss/open-data-anonymizer, and link back here

@KrishPatel13
Copy link
Collaborator Author

KrishPatel13 commented Jul 24, 2023

Please submit an issue to https://github.com/openredact/anonymizer/issues and https://github.com/ArtLabss/open-data-anonymizer, and link back here

Open Data Anonymizer: ArtLabss/open-data-anonymizer#28

Open Redact: openredact/anonymizer#11

@KrishPatel13
Copy link
Collaborator Author

KrishPatel13 commented Aug 11, 2023

Todo:

  • Amazon Comprehend
  • Private AI

@KrishPatel13
Copy link
Collaborator Author

Finally made AWS Comprehend to work:

Ran a sample test script:

(openadapt-py3.10) PS P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction> python .\comprehend_detect.py
----------------------------------------------------------------------------------------
Welcome to the Amazon Comprehend detection demo!
----------------------------------------------------------------------------------------
INFO: Found credentials in shared credentials file: ~/.aws/credentials
Sample text used for this demo:
----------------------------------------------------------------------------------------
Hello Zhang Wei. Your AnyCompany Financial Services, LLC credit card account
1111-0000-1111-0000 has a minimum payment of $24.53 that is due by July 31st.
Based on your autopay settings, we will withdraw your payment on the due date from
your bank account XXXXXX1111 with the routing number XXXXX0000.

Your latest statement was mailed to 100 Main Street, Anytown, WA 98121.
After your payment is received, you will receive a confirmation text message
at 206-555-0100.

If you have questions about your bill, AnyCompany Customer Service is available by
phone at 206-555-0199 or email at support@anycompany.com.

----------------------------------------------------------------------------------------
Detecting languages.
INFO: Detected 1 languages.
[{'LanguageCode': 'en', 'Score': 0.9954520463943481}]
Detecting entities.
INFO: Detected 12 entities.
The first 3 are:
[{'BeginOffset': 6,
  'EndOffset': 15,
  'Score': 0.9991974830627441,
  'Text': 'Zhang Wei',
  'Type': 'PERSON'},
 {'BeginOffset': 22,
  'EndOffset': 56,
  'Score': 0.9989182949066162,
  'Text': 'AnyCompany Financial Services, LLC',      
  'Type': 'ORGANIZATION'},
 {'BeginOffset': 77,
  'EndOffset': 96,
  'Score': 0.9873828291893005,
  'Text': '1111-0000-1111-0000',
  'Type': 'OTHER'}]
Detecting key phrases.
INFO: Detected 23 phrases.
The first 3 are:
[{'BeginOffset': 0,
  'EndOffset': 15,
  'Score': 0.8426342606544495,
  'Text': 'Hello Zhang Wei'},
 {'BeginOffset': 17,
  'EndOffset': 51,
  'Score': 0.9881375432014465,
  'Text': 'Your AnyCompany Financial Services'},     
 {'BeginOffset': 53,
  'EndOffset': 96,
  'Score': 0.8444651961326599,
  'Text': 'LLC credit card account\n1111-0000-1111-0000'}]
Detecting personally identifiable information (PII). 
INFO: Detected 9 PII entities.
The first 3 are:
[{'BeginOffset': 6,
  'EndOffset': 15,
  'Score': 0.9997988939285278,
  'Type': 'NAME'},
 {'BeginOffset': 77,
  'EndOffset': 96,
  'Score': 0.999958872795105,
  'Type': 'CREDIT_DEBIT_NUMBER'},
 {'BeginOffset': 144,
  'EndOffset': 153,
  'Score': 0.9999958872795105,
  'Type': 'DATE_TIME'}]
Detecting sentiment.
INFO: Detected primary sentiment NEUTRAL.
Sentiment: NEUTRAL
SentimentScore:
{'Mixed': 9.69591928878799e-06,
INFO: Detected 107 syntax tokens.
The first 3 are:
[{'BeginOffset': 0,
  'EndOffset': 5,
  'PartOfSpeech': {'Score': 0.9888805150985718, 'Tag': 'INTJ'},
  'Text': 'Hello',
  'TokenId': 1},
 {'BeginOffset': 6,
  'EndOffset': 11,
  'PartOfSpeech': {'Score': 0.9991546273231506, 'Tag': 'PROPN'},
  'Text': 'Zhang',
  'TokenId': 2},
 {'BeginOffset': 12,
  'EndOffset': 15,
  'PartOfSpeech': {'Score': 0.9982988238334656, 'Tag': 'PROPN'},
  'Text': 'Wei',
  'TokenId': 3}]
Thanks for watching!
----------------------------------------------------------------------------------------
(openadapt-py3.10) PS P:\OpenAdapt AI - MLDS AI\cloned_repo\test_other\OpenAdapt\openadapt\research_redaction>

@KrishPatel13
Copy link
Collaborator Author

Looks good at LLC:

{'BeginOffset': 22,
'EndOffset': 56,
'Score': 0.9989182949066162,
'Text': 'AnyCompany Financial Services, LLC',
'Type': 'ORGANIZATION'},

@KrishPatel13
Copy link
Collaborator Author

KrishPatel13 commented Aug 21, 2023

Private AI Update: I have requested the free API key. Currently waiting for the API key.

image

@KrishPatel13
Copy link
Collaborator Author

KrishPatel13 commented Aug 21, 2023

Next Step:

  • Add AWS as a scrubbing provider in privacy

#476 - Ready for review @abrichr

@KrishPatel13
Copy link
Collaborator Author

KrishPatel13 commented Aug 24, 2023

Private AI Testing Results:

Images:

====================================================================
Before:
image

After:
sample_emr_1_redacted_file

====================================================================

PDFs:

====================================================================
Before:
sample_llc_1.pdf

After:
sample_llc_1_redacted_file.pdf

====================================================================

Text:

====================================================================
It work exactly same as CapePrivacy.
I has capabilities to over 50 entities: https://docs.private-ai.com/entities/
It is better than Presidio, Amazon's AWS Comprehend.

Cons:
It is commercial. We get a free API key for 500 words only. Then it charged. Pricing: https://www.private-ai.com/pricing/#redact

====================================================================

@KrishPatel13
Copy link
Collaborator Author

@abrichr We should close this when #486 is merged. 🙏

@abrichr abrichr closed this Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants