Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error processing cyrillic strings in Tokenizer #1462

Closed
sspat opened this issue May 10, 2017 · 24 comments
Closed

Error processing cyrillic strings in Tokenizer #1462

sspat opened this issue May 10, 2017 · 24 comments
Milestone

Comments

@sspat
Copy link

sspat commented May 10, 2017

What versions are affected?
This bug appeared when i switched from 2.8.1 to 3.0.0.
PHP version 7.1.4.

What causes the bug?
It happens when the analyzed source code contains cyrillic strings.

Example code causing the bug
<?php $arr = [ 'ы' => 1 ];
<?php const FOO = 'ы';

The error message is:
iconv_strlen(): Detected an illegal character in input string in /vendor/squizlabs/php_codesniffer/src/Tokenizers/Tokenizer.php on line 193

@sspat sspat changed the title Error processing cyrillic array keys in Tokenizer Error processing cyrillic strings in Tokenizer May 10, 2017
@sspat sspat closed this as completed May 10, 2017
@sspat sspat reopened this May 10, 2017
@gsherwood
Copy link
Member

I don't get any errors running PHPCS 3 over your sample code.

The line reporting the error is actually muting any error output from iconv_strlen(), but I removed that error suppression when debugging and still didn't get any errors.

I'm wondering if you have set an encoding that is incompatible with the content of your files. Possible set while using 2.x because the default was not utf-8, but version 3 uses utf-8 by default.

Does any of that sound possible?

It might also be worth running phpcs over the sample code and using the -vv command line option. That output will show how the file is tokenized and how your coding standard is being loaded.

@sspat
Copy link
Author

sspat commented May 11, 2017

I am running phpcs as an inspection in PHPStorm. The exact error message PHPStorm is giving me is:

phpcs: Internal.Exception: An error occurred during processing; checking has been aborted. The error message was: iconv_strlen(): Detected an illegal character in input string in /vagrant/vendor/squizlabs/php_codesniffer/src/Tokenizers/Tokenizer.php on line 19

The file analyzed is encoded in UTF-8.

file -i test.php
test.php: text/x-php; charset=utf-8

As far as i can tell PHPSTorm doesn't change the default encoding in phpcs settings, i could not find any means to pass configuration to phpcs when running it from PHPSTorm.

These are the exact contents of test.php:

<?php
$foo = 'ы';

This is the output of the vendor/squizlabs/php_codesniffer/bin/phpcs -vv test.php command when run on the command line manually with phpcs 3.0.0:

Processing ruleset /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/ruleset.xml
        Adding sniff files from /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs directory
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/Classes/ClassDeclarationSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/Commenting/ClassCommentSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/Commenting/FileCommentSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/Commenting/FunctionCommentSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/Commenting/InlineCommentSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/ControlStructures/ControlSignatureSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/ControlStructures/MultiLineConditionSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/Files/IncludingFileSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/Formatting/MultiLineAssignmentSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/Functions/FunctionCallSignatureSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/Functions/FunctionDeclarationSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/Functions/ValidDefaultValueSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/NamingConventions/ValidClassNameSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/NamingConventions/ValidFunctionNameSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/NamingConventions/ValidVariableNameSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/WhiteSpace/ObjectOperatorIndentSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/WhiteSpace/ScopeClosingBraceSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/PEAR/Sniffs/WhiteSpace/ScopeIndentSniff.php
        Processing rule "Generic.Functions.FunctionCallArgumentSpacing"
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/Generic/Sniffs/Functions/FunctionCallArgumentSpacingSniff.php
        Processing rule "Generic.NamingConventions.UpperCaseConstantName"
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/Generic/Sniffs/NamingConventions/UpperCaseConstantNameSniff.php
        Processing rule "Generic.PHP.LowerCaseConstant"
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/Generic/Sniffs/PHP/LowerCaseConstantSniff.php
        Processing rule "Generic.PHP.DisallowShortOpenTag"
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/Generic/Sniffs/PHP/DisallowShortOpenTagSniff.php
        Processing rule "Generic.WhiteSpace.DisallowTabIndent"
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/Generic/Sniffs/WhiteSpace/DisallowTabIndentSniff.php
        Processing rule "Generic.Commenting.DocComment"
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/Generic/Sniffs/Commenting/DocCommentSniff.php
        Processing rule "Generic.Files.LineLength"
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/Generic/Sniffs/Files/LineLengthSniff.php
                => property "lineLimit" set to "85"
                => property "absoluteLineLimit" set to "0"
        Processing rule "Generic.Files.LineEndings"
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/Generic/Sniffs/Files/LineEndingsSniff.php
                => property "eolChar" set to "\n"
        Processing rule "Generic.Functions.FunctionCallArgumentSpacing.TooMuchSpaceAfterComma"
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/Generic/Sniffs/Functions/FunctionCallArgumentSpacingSniff.php
                => severity set to 0
        Processing rule "Generic.ControlStructures.InlineControlStructure"
                => /vagrant/vendor/squizlabs/php_codesniffer/src/Standards/Generic/Sniffs/ControlStructures/InlineControlStructureSniff.php
                => property "error" set to "false"
=> Ruleset processing complete; included 27 sniffs and excluded 0
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  do
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_OPEN_CURLY_BRACKET => {
        Process token [4]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token [2]: T_WHILE => while
        Process token [3]: T_WHITESPACE => В·
        Process token  4 : T_OPEN_PARENTHESIS => (
        Process token [5]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  ;
        Process token [2]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  while
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_OPEN_PARENTHESIS => (
        Process token [4]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token  2 : T_OPEN_CURLY_BRACKET => {
        Process token [3]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  for
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_OPEN_PARENTHESIS => (
        Process token [4]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token  2 : T_OPEN_CURLY_BRACKET => {
        Process token [3]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  if
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_OPEN_PARENTHESIS => (
        Process token [4]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token  2 : T_OPEN_CURLY_BRACKET => {
        Process token [3]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  foreach
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_OPEN_PARENTHESIS => (
        Process token [4]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token  2 : T_OPEN_CURLY_BRACKET => {
        Process token [3]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  }
        Process token [2]: T_WHITESPACE => В·
        Process token [3]: T_ELSE => else
        Process token [4]: T_WHITESPACE => В·
        Process token [5]: T_IF => if
        Process token [6]: T_WHITESPACE => В·
        Process token  7 : T_OPEN_PARENTHESIS => (
        Process token [8]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token  2 : T_OPEN_CURLY_BRACKET => {
        Process token [3]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  }
        Process token [2]: T_WHITESPACE => В·
        Process token [3]: T_ELSEIF => elseif
        Process token [4]: T_WHITESPACE => В·
        Process token  5 : T_OPEN_PARENTHESIS => (
        Process token [6]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token  2 : T_OPEN_CURLY_BRACKET => {
        Process token [3]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  }
        Process token [2]: T_WHITESPACE => В·
        Process token [3]: T_ELSE => else
        Process token [4]: T_WHITESPACE => В·
        Process token  5 : T_OPEN_CURLY_BRACKET => {
        Process token [6]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  do
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_OPEN_CURLY_BRACKET => {
        Process token [4]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
Creating file list... DONE (1 files in queue)
Changing into directory /vagrant
Processing test.php 
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  $foo
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_EQUAL => =
        Process token [4]: T_WHITESPACE => В·
        Process token [5]: T_CONSTANT_ENCAPSED_STRING => 'С‹'
        Process token  6 : T_SEMICOLON => ;
        Process token [7]: T_WHITESPACE => \r\n
        *** END PHP TOKENIZING ***
        *** START TOKEN MAP ***
        *** END TOKEN MAP ***
        *** START SCOPE MAP ***
        *** END SCOPE MAP ***
        *** START LEVEL MAP ***
        Process token 0 on line 1 [col:1;len:5;lvl:0;]: T_OPEN_TAG =>  $foo
        Process token 2 on line 2 [col:5;len:1;lvl:0;]: T_WHITESPACE => В·
        Process token 3 on line 2 [col:6;len:1;lvl:0;]: T_EQUAL => =
        Process token 4 on line 2 [col:7;len:1;lvl:0;]: T_WHITESPACE => В·
        Process token 5 on line 2 [col:8;len:3;lvl:0;]: T_CONSTANT_ENCAPSED_STRING => 'С‹'
        Process token 6 on line 2 [col:11;len:1;lvl:0;]: T_SEMICOLON => ;
        Process token 7 on line 2 [col:12;len:0;lvl:0;]: T_WHITESPACE => \r\n
        *** END LEVEL MAP ***
        *** START ADDITIONAL PHP PROCESSING ***
        *** END ADDITIONAL PHP PROCESSING ***
[PHP => 8 tokens in 2 lines]... 
DONE in 3ms (2 errors, 0 warnings)

FILE: /vagrant/test.php
----------------------------------------------------------------------------------
FOUND 2 ERRORS AFFECTING 2 LINES
----------------------------------------------------------------------------------
 1 | ERROR | [x] End of line character is invalid; expected "\n" but found "\r\n"
 2 | ERROR | [ ] Missing file doc comment
----------------------------------------------------------------------------------
PHPCBF CAN FIX THE 1 MARKED SNIFF VIOLATIONS AUTOMATICALLY
----------------------------------------------------------------------------------

Time: 213ms; Memory: 4Mb

This is the output of the same command on the same file, but the cyrillic 'ы' is replaced by latin 'a'. The sniffs are all the same so i omitted that part.

Processing test.php 
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  $foo
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_EQUAL => =
        Process token [4]: T_WHITESPACE => В·
        Process token [5]: T_CONSTANT_ENCAPSED_STRING => 'a'
        Process token  6 : T_SEMICOLON => ;
        Process token [7]: T_WHITESPACE => \r\n
        *** END PHP TOKENIZING ***
        *** START TOKEN MAP ***
        *** END TOKEN MAP ***
        *** START SCOPE MAP ***
        *** END SCOPE MAP ***
        *** START LEVEL MAP ***
        Process token 0 on line 1 [col:1;len:5;lvl:0;]: T_OPEN_TAG =>  $foo
        Process token 2 on line 2 [col:5;len:1;lvl:0;]: T_WHITESPACE => В·
        Process token 3 on line 2 [col:6;len:1;lvl:0;]: T_EQUAL => =
        Process token 4 on line 2 [col:7;len:1;lvl:0;]: T_WHITESPACE => В·
        Process token 5 on line 2 [col:8;len:3;lvl:0;]: T_CONSTANT_ENCAPSED_STRING => 'a'
        Process token 6 on line 2 [col:11;len:1;lvl:0;]: T_SEMICOLON => ;
        Process token 7 on line 2 [col:12;len:0;lvl:0;]: T_WHITESPACE => \r\n
        *** END LEVEL MAP ***
        *** START ADDITIONAL PHP PROCESSING ***
        *** END ADDITIONAL PHP PROCESSING ***
[PHP => 8 tokens in 2 lines]... 
DONE in 4ms (2 errors, 0 warnings)

FILE: /vagrant/test.php
----------------------------------------------------------------------------------
FOUND 2 ERRORS AFFECTING 2 LINES
----------------------------------------------------------------------------------
 1 | ERROR | [x] End of line character is invalid; expected "\n" but found "\r\n"
 2 | ERROR | [ ] Missing file doc comment
----------------------------------------------------------------------------------
PHPCBF CAN FIX THE 1 MARKED SNIFF VIOLATIONS AUTOMATICALLY
----------------------------------------------------------------------------------

Time: 220ms; Memory: 4Mb

This is the output with phpcs 2.8.1., there are no errors in PHPSTorm with this version:

Processing ruleset /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/ruleset.xml
        Adding sniff files from "/.../PEAR/Sniffs/" directory
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/Classes/ClassDeclarationSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/Commenting/ClassCommentSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/Commenting/FileCommentSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/Commenting/FunctionCommentSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/Commenting/InlineCommentSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/ControlStructures/ControlSignatureSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/ControlStructures/MultiLineConditionSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/Files/IncludingFileSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/Formatting/MultiLineAssignmentSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/Functions/FunctionCallSignatureSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/Functions/FunctionDeclarationSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/Functions/ValidDefaultValueSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/NamingConventions/ValidClassNameSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/NamingConventions/ValidFunctionNameSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/NamingConventions/ValidVariableNameSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/WhiteSpace/ObjectOperatorIndentSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/WhiteSpace/ScopeClosingBraceSniff.php
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/PEAR/Sniffs/WhiteSpace/ScopeIndentSniff.php
        Processing rule "Generic.Functions.FunctionCallArgumentSpacing"
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/Generic/Sniffs/Functions/FunctionCallArgumentSpacingSniff.php
        Processing rule "Generic.NamingConventions.UpperCaseConstantName"
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/Generic/Sniffs/NamingConventions/UpperCaseConstantNameSniff.php
        Processing rule "Generic.PHP.LowerCaseConstant"
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/Generic/Sniffs/PHP/LowerCaseConstantSniff.php
        Processing rule "Generic.PHP.DisallowShortOpenTag"
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/Generic/Sniffs/PHP/DisallowShortOpenTagSniff.php
        Processing rule "Generic.WhiteSpace.DisallowTabIndent"
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/Generic/Sniffs/WhiteSpace/DisallowTabIndentSniff.php
        Processing rule "Generic.Commenting.DocComment"
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/Generic/Sniffs/Commenting/DocCommentSniff.php
        Processing rule "Generic.Files.LineLength"
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/Generic/Sniffs/Files/LineLengthSniff.php
                => property "lineLimit" set to "85"
                => property "absoluteLineLimit" set to "0"
        Processing rule "Generic.Files.LineEndings"
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/Generic/Sniffs/Files/LineEndingsSniff.php
                => property "eolChar" set to "\n"
        Processing rule "Generic.Functions.FunctionCallArgumentSpacing.TooMuchSpaceAfterComma"
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/Generic/Sniffs/Functions/FunctionCallArgumentSpacingSniff.php
                => severity set to 0
        Processing rule "Generic.ControlStructures.InlineControlStructure"
                => /vagrant/vendor/squizlabs/php_codesniffer/CodeSniffer/Standards/Generic/Sniffs/ControlStructures/InlineControlStructureSniff.php
                => property "error" set to "false"
=> Ruleset processing complete; included 27 sniffs and excluded 0
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  do
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_OPEN_CURLY_BRACKET => {
        Process token [4]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token [2]: T_WHILE => while
        Process token [3]: T_WHITESPACE => В·
        Process token  4 : T_OPEN_PARENTHESIS => (
        Process token [5]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  ;
        Process token [2]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  while
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_OPEN_PARENTHESIS => (
        Process token [4]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token  2 : T_OPEN_CURLY_BRACKET => {
        Process token [3]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  for
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_OPEN_PARENTHESIS => (
        Process token [4]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token  2 : T_OPEN_CURLY_BRACKET => {
        Process token [3]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  if
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_OPEN_PARENTHESIS => (
        Process token [4]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token  2 : T_OPEN_CURLY_BRACKET => {
        Process token [3]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  foreach
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_OPEN_PARENTHESIS => (
        Process token [4]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token  2 : T_OPEN_CURLY_BRACKET => {
        Process token [3]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  }
        Process token [2]: T_WHITESPACE => В·
        Process token [3]: T_ELSE => else
        Process token [4]: T_WHITESPACE => В·
        Process token [5]: T_IF => if
        Process token [6]: T_WHITESPACE => В·
        Process token  7 : T_OPEN_PARENTHESIS => (
        Process token [8]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token  2 : T_OPEN_CURLY_BRACKET => {
        Process token [3]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  }
        Process token [2]: T_WHITESPACE => В·
        Process token [3]: T_ELSEIF => elseif
        Process token [4]: T_WHITESPACE => В·
        Process token  5 : T_OPEN_PARENTHESIS => (
        Process token [6]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  В·
        Process token  2 : T_OPEN_CURLY_BRACKET => {
        Process token [3]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  }
        Process token [2]: T_WHITESPACE => В·
        Process token [3]: T_ELSE => else
        Process token [4]: T_WHITESPACE => В·
        Process token  5 : T_OPEN_CURLY_BRACKET => {
        Process token [6]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  do
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_OPEN_CURLY_BRACKET => {
        Process token [4]: T_CLOSE_TAG => ?>
        *** END PHP TOKENIZING ***
Creating file list... DONE (1 files in queue)
Changing into directory /vagrant
Processing test.php 
        *** START PHP TOKENIZING ***
        Process token [0]: T_OPEN_TAG =>  $foo
        Process token [2]: T_WHITESPACE => В·
        Process token  3 : T_EQUAL => =
        Process token [4]: T_WHITESPACE => В·
        Process token [5]: T_CONSTANT_ENCAPSED_STRING => 'С‹'
        Process token  6 : T_SEMICOLON => ;
        Process token [7]: T_WHITESPACE => \r\n
        *** END PHP TOKENIZING ***
        *** START TOKEN MAP ***
        *** END TOKEN MAP ***
        *** START SCOPE MAP ***
        *** END SCOPE MAP ***
        *** START LEVEL MAP ***
        Process token 0 on line 1 [col:1;len:5;lvl:0;]: T_OPEN_TAG =>  $foo
        Process token 2 on line 2 [col:5;len:1;lvl:0;]: T_WHITESPACE => В·
        Process token 3 on line 2 [col:6;len:1;lvl:0;]: T_EQUAL => =
        Process token 4 on line 2 [col:7;len:1;lvl:0;]: T_WHITESPACE => В·
        Process token 5 on line 2 [col:8;len:4;lvl:0;]: T_CONSTANT_ENCAPSED_STRING => 'С‹'
        Process token 6 on line 2 [col:12;len:1;lvl:0;]: T_SEMICOLON => ;
        Process token 7 on line 2 [col:13;len:0;lvl:0;]: T_WHITESPACE => \r\n
        *** END LEVEL MAP ***
        *** START ADDITIONAL PHP PROCESSING ***
        *** END ADDITIONAL PHP PROCESSING ***
[PHP => 8 tokens in 2 lines]... 
DONE in 10ms (2 errors, 0 warnings)

FILE: /vagrant/test.php
----------------------------------------------------------------------
FOUND 2 ERRORS AFFECTING 2 LINES
----------------------------------------------------------------------
 1 | ERROR | [x] End of line character is invalid; expected "\n" but
   |       |     found "\r\n"
 2 | ERROR | [ ] Missing file doc comment
----------------------------------------------------------------------
PHPCBF CAN FIX THE 1 MARKED SNIFF VIOLATIONS AUTOMATICALLY
----------------------------------------------------------------------

Time: 162ms; Memory: 4Mb

@gsherwood
Copy link
Member

So you don't see any PHP errors when using PHPCS on the command line? But you also don't see the content correctly in the output? (I do see the content correctly in my debug output)

Can you try running vendor/squizlabs/php_codesniffer/bin/phpcs test.php -vv --encoding=utf-8 and see if it makes any difference? I can only replicate your error if I pass an invalid encoding, such as utf-16.

It would also be good if you could paste the output of vendor/squizlabs/php_codesniffer/bin/phpcs --config-show. It might be empty, but it's worth checking just in case.

@sspat
Copy link
Author

sspat commented May 11, 2017

Yes, there are no errors when running from command line.
If by not seeing correct output you mean that i don't see cyrillic symbols in the tokenizer output - that is correct, they show up broken. Please take note, that version 2.8.1. also has problems with outputting cyrillic to the console on my system, but it does not cause any errors in PHPStorm.

I tried running with the --encoding=utf-8 flag - the output was the same as without it.

When running with the --config-show flag I get empty output on both versions - 2.8.1 and 3.0.0.

@gsherwood
Copy link
Member

I'm not really sure what is going on then. I didn't make any big changes to that code in 3.0 except for changing the default encoding to utf-8.

If you'd like to do some debugging, the easiest thing to do is drop in an echo before line 193 in Tokenizer.php. The code in that area looks like this:

// There are no tabs in this content, or we aren't replacing them.
if ($checkEncoding === true) {
    // Not using the default encoding, so take a bit more care.
    $length = @iconv_strlen($this->tokens[$i]['content'], $this->config->encoding);
    if ($length === false) {
        // String contained invalid characters, so revert to default.
        $length = strlen($this->tokens[$i]['content']);
    }
} else {
    $length = strlen($this->tokens[$i]['content']);
}

Line 193 is that call to iconv_strlen(). Change that code to look like this:

// There are no tabs in this content, or we aren't replacing them.
if ($checkEncoding === true) {
    // Not using the default encoding, so take a bit more care.
    echo 'Trying to get length for "'.$this->tokens[$i]['content'].'" using encoding "'.$this->config->encoding.'"'.PHP_EOL;
    $length = @iconv_strlen($this->tokens[$i]['content'], $this->config->encoding);
    if ($length === false) {
        // String contained invalid characters, so revert to default.
        $length = strlen($this->tokens[$i]['content']);
    }
} else {
    $length = strlen($this->tokens[$i]['content']);
}

Running PHPCS over a file should then give you output like this:

Trying to get length for "<?php
" using encoding "utf-8"
Trying to get length for "$arr" using encoding "utf-8"
Trying to get length for " " using encoding "utf-8"
Trying to get length for " " using encoding "utf-8"
Trying to get length for " " using encoding "utf-8"
Trying to get length for "'ы'" using encoding "utf-8"
Trying to get length for " " using encoding "utf-8"
Trying to get length for " " using encoding "utf-8"
Trying to get length for "1" using encoding "utf-8"
Trying to get length for " " using encoding "utf-8"
Trying to get length for "
" using encoding "utf-8"
Trying to get length for " " using encoding "utf-8"
Trying to get length for "FOO" using encoding "utf-8"
Trying to get length for " " using encoding "utf-8"
Trying to get length for " " using encoding "utf-8"
Trying to get length for "'ы'" using encoding "utf-8"
Trying to get length for "
" using encoding "utf-8"

...

If nothing else, it would hopefully show us where the error is, although you might need to run it via PHPStorm and hope it shows all output.

@mourawaldson
Copy link

I am having the same issue on PHPStorm on 3.0.1.

Since I didn't have much time to investigate, I'm just gonna give some heads up on what I found.

This is happening not only for "cyrillic" characters, but many others, and the issue seems to be on the way "iconv/iconv_strlen" works.

Run the following sample code and you'll see the issue happening regardless:

<?php
$str = "I�t�rn�ti�n\xe9�liz�ti�n";

print "mb_strlen: ".mb_strlen($str,'UTF-8')."\n";
print "strlen/utf8_decode: ".strlen(utf8_decode($str))."\n";
print "iconv_strlen: ".iconv_strlen($str,'UTF-8')."\n";
?>

Source: http://php.net/manual/en/function.iconv-strlen.php#62320

Result:

mb_strlen: 21
strlen/utf8_decode: 21
<br />
<b>Notice</b>:  iconv_strlen(): Detected an illegal character in input string in <b>[...][...]</b> on line <b>7</b><br />
iconv_strlen: 

In my case, I get the same error as reported "iconv_strlen(): Detected an illegal character in input string in /vendor/squizlabs/php_codesniffer/src/Tokenizers/Tokenizer.php on line 193".
But is on a phpdoc using accents like "éãóíç", etc.

Sorry for not having time to properly write a test case, nor create a PR to fix this, but I did a test simply using "mb_strlen" instead of "iconv_strlen" on squizlabs/php_codesniffer/src/Tokenizers/Tokenizer.php:193 and seemed to work fine.

I know this is not the actual fix, but at least this may lead to a solution.

Here are some other references just in case:

Also I strongly suggest to stop suppressing with "@" since it makes harder to identify the issue.

I hope this helps.

@mourawaldson
Copy link

It's also good to mention that when this issue happens, the code sniffer does not continue to check the rest of the file, so it just shows a warning on <?php tag then nothing else is evaluated, at least on PHPStorm.

@AndreiZiblitski
Copy link

AndreiZiblitski commented Jul 10, 2017

I am having the same issue on PHPStorm on 3.0.1. =(
The bug appears if document has symbol 'copyright' => ©. In my case it is almost all php files.

@irudoy
Copy link

irudoy commented Jul 18, 2017

Exactly the same issue
VS Code 1.15.0
PHP_CodeSniffer version 3.0.1

@gsherwood gsherwood added this to the 3.2.0 milestone Jul 20, 2017
@gsherwood
Copy link
Member

Added this to the 3.0.1 milestone, but still can't replicate it, so it's more just to revisit it when I am working on that version.

If anyone is able to replicate while passing the correct encoding to PHPCS, please let me know what content is causing the error. If you are able to add the debug code I provided above, that would be very helpful as well.

@mourawaldson
Copy link

Using CLI this is not visible.

Here's the test:

  • File content:
<?php
// é

  • Result:
Trying to get length for "<?php
" using encoding "utf-8"
Trying to get length for "// é
" using encoding "utf-8"

FILE: /var/www/projects/spartan_billing/test.php
----------------------------------------------------------------------
FOUND 1 ERROR AFFECTING 1 LINE
----------------------------------------------------------------------
 2 | ERROR | You must use "/**" style comments for a file comment
----------------------------------------------------------------------
  • What PHP Storm shows:
    screenshot 2017-07-20 02 33 33

Again, as I've mentioned on my previous comment, while using mb_strlen this does not happen and seems to not break other stuff.
Although is still not clear for me why this happens only on PHP Storm, is there any specific reason to keep using iconv?

@davidfavor
Copy link

This problem also occurs in 3.0.2 across many files, even if --encoding=utf-8 is passed.

Since mb_strlen() seems to fix the problem, be great if this fix could be rolled in.

In my case, I have several 100s of files where phpcs just dies with the iconv message. Trying to find the one offending character to fix... to... get past the code bailing would be a monumental task.

@davidfavor
Copy link

Making the mb_strlen change suggested by @mourawaldson works like a charm.

Be great if this minor code change (4 lines of code) could be rolled into the Tokenizer.php file + a new version of PHP_CodeSniffer released.

Also, please do remove the "@" shutup operator as @mourawaldson suggested also.

Thanks for your consideration of this fix.

@chybaDapi
Copy link

PR #1611 will fix this issue. I have not found any additional issues with mb_strlen().

@gsherwood
Copy link
Member

gsherwood commented Nov 15, 2017

I've been looking into this more and I think that using mb_strlen doesn't really fix anything. Yes, it will run without error if there is an encoding mismatch, but it still wont produce the correct length in this case. You may as well just call strlen and use what it has because both values will be wrong.

The reason the @ operator is there is to allow PHPCS to use iconv_strlen without having to worry about the E_NOTICE that is produced when you've got the encoding wrong. But I've added a custom error handler in version 3, which is why this error is now being reported where it was previously muted.

So my current thinking is that I'll just change the error reporting settings while calling iconv_strlen so that it reverts back to the previous behaviour where the value would ultimately come from strlen but no error would be shown.

I still have no idea why PHP Storm is causing this issue while a CLI run is not. It feels like the content is either being saved in the wrong encoding or the wrong encoding is being passed to PHPCS (or no encoding is being passed). It's quite likely this has always failed in PHP Storm but the error was just being suppressed properly by the use of @ in version 2.x.

gsherwood added a commit that referenced this issue Nov 16, 2017
The error from iconv_strlen when a string contains invalid chars (based on the encoding) was no longer being muted due to the new error handler in the Runner class. This commit replaces the mute operator with an error_reporting change to properly mute that error again and allow files to be checked even with mixed encoding.
@gsherwood
Copy link
Member

I've pushed the change I described in the previous comment. This should restore the previous behaviour from version 2.x where the iconv_strlen error is muted. I'll leave this in feedback for a little while in case anyone has some time to test it.

I've mentioned this in a few places, but not here, so: I'm not going to switch to using mb_strlen because that would mean a serious BC break for PHPCS due to new requirements. I would only consider a change like that in a major version (version 4) and only if it performed significantly better as iconv is a default extension and mb is not.

@davidfavor
Copy link

davidfavor commented Nov 16, 2017

Changing iconv_strlen to mb_strlen fixes all these errors for me.

I've take to manually patching Tokenizer.php every time it updates.

This is the best way to fix 100s + sometimes 1000s of phpcs bailouts, where processing stops + no reports are generated.

You can't really just mute the iconv_strlen related problems, because whenever one of these problems is hit, phpcs processing stops + errors out.

Wrapping iconv_strlen in an eval + ignoring exceptions raised will likely work.

@mourawaldson
Copy link

Hi @gsherwood,

I understand your concern, but when you say "but it still wont produce the correct length in this case" is because of what? Do you have an sample code that gives different results?

I'll try to take sometime to test this out.

@gsherwood
Copy link
Member

I understand your concern, but when you say "but it still wont produce the correct length in this case" is because of what?

If the encoding you specify doesn't match the encoding of the string, you wont get a correct count. The iconv extension handles this by throwing an error. The mbstring extension handles this without errors (I can't remember if it ignores chars, or counts bytes, or both) but it can't actually give the correct result.

Here is some sample code with a string I was using for testing this:

<?php
$str = 'А а, Б б, В в';
echo 'strlen: '.strlen($str).PHP_EOL;
echo 'mb_strlen(utf-8): '.mb_strlen($str, 'utf-8').PHP_EOL;
echo 'mb_strlen(utf-16): '.mb_strlen($str, 'utf-16').PHP_EOL;
echo 'mb_strlen(windows-1252): '.mb_strlen($str, 'windows-1252').PHP_EOL;
echo 'iconv_strlen(utf-8): '.iconv_strlen($str, 'utf-8').PHP_EOL;
echo 'iconv_strlen(utf-16): '.iconv_strlen($str, 'utf-16').PHP_EOL;
echo 'iconv_strlen(windows-1252): '.iconv_strlen($str, 'windows-1252').PHP_EOL;

Which outputs:

strlen: 19
mb_strlen(utf-8): 13
mb_strlen(utf-16): 9
mb_strlen(windows-1252): 19
iconv_strlen(utf-8): 13
PHP Notice:  iconv_strlen(): Detected an incomplete multibyte character in input string in /Users/gsherwood/Sites/Projects/PHP_CodeSniffer/temp.php on line 8

Notice: iconv_strlen(): Detected an incomplete multibyte character in input string in /Users/gsherwood/Sites/Projects/PHP_CodeSniffer/temp.php on line 8
iconv_strlen(utf-16):
PHP Notice:  iconv_strlen(): Detected an illegal character in input string in /Users/gsherwood/Sites/Projects/PHP_CodeSniffer/temp.php on line 9

Notice: iconv_strlen(): Detected an illegal character in input string in /Users/gsherwood/Sites/Projects/PHP_CodeSniffer/temp.php on line 9
iconv_strlen(windows-1252):

The correct string length is 13 characters, which is fine when you've passed in UTF-8. If I just subbed in mb_strlen for iconv_strlen, I'd still be running with incorrect values if you have passed in the incorrect encoding. That's obviously what's happening here with PHPStorm because the CLI is working fine but iconv is still finding invalid chars when run via PHPStorm. The passes encoding must be wrong.

@gsherwood
Copy link
Member

Changing iconv_strlen to mb_strlen fixes all these errors for me.

It works because mb_strlen wont produce errors. It is likely not producing the correct result though, unless you've configured it to always use the one encoding that you use, which PHPCS obviously can't do.

I've take to manually patching Tokenizer.php every time it updates.

You might be better off testing the fix I committed first as this restores the PHPCS 2.x behaviour, which presumably worked for you.

This is the best way to fix 100s + sometimes 1000s of phpcs bailouts, where processing stops + no reports are generated.

You can't really just mute the iconv_strlen related problems, because whenever one of these problems is hit, phpcs processing stops + errors out.

Wrapping iconv_strlen in an eval + ignoring exceptions raised will likely work.

You should read my previous comment about why it was muted, why muting broke in version 3, and how I was going to fix it.

@mourawaldson
Copy link

Thanks for sharing your test @gsherwood!

I wasn't aware that other encoding than UTF-8 was giving different results.

Anyway, I gave a try here with your change and is working fine on PHPStorm.

Although I quickly looked into some benchmarks comparing iconv_strlen vs mb_strlen and mb performs way better, but as you said seems is not a simple change, the impact is big actually.

For me this is considered fixed by your changes.

Thanks again.

@davidfavor
Copy link

If I pull a git copy of the project right now, let me know if your fix will be available in the pull.

@gsherwood
Copy link
Member

If I pull a git copy of the project right now, let me know if your fix will be available in the pull.

It's been committed, so it will be there if you pull master.

@davidfavor
Copy link

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants