Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeing regular files with special characters in it #13279

Open
alexanderadam opened this issue Oct 23, 2020 · 18 comments
Open

Seeing regular files with special characters in it #13279

alexanderadam opened this issue Oct 23, 2020 · 18 comments
Labels
issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented issue/stale

Comments

@alexanderadam
Copy link

alexanderadam commented Oct 23, 2020

  • Gitea version (or commit ref): 1.14.0

Description

I cannot see a file if it contains special chars. Is there any way to enforce watching the file anyway?
See this repo for instance.

The file contains this

# frozen_string_literal: true

puts 'Foo�bar!'

And it makes sense that gitea thinks, that this could be a binary file because of the . But it obviously isn't a binary file and I would love to see the file anyway. Instead of having to download it.

Funnily enough, the commit view shows the content anyway. So something is inconsistent here.

Screenshots

Screenshot from 2020-10-23 12-55-05

@lunny
Copy link
Member

lunny commented Oct 23, 2020

What character does the filename contain?

@alexanderadam
Copy link
Author

alexanderadam commented Oct 23, 2020

The file contains this

# frozen_string_literal: true

puts 'Foo�bar!'

and the character causing this is "".

You can also inspect / checkout the example repo if it helps.

@silverwind
Copy link
Member

silverwind commented Oct 23, 2020

That's the unicode replacement character which essentially should be considered as text, not binary. How does GitHub handle this case?

@alexanderadam
Copy link
Author

alexanderadam commented Oct 23, 2020

That's the unicode replacement character which essntially should be considered as text, not binary.

I pasted the exact character. GitHub is only showing the Unicode replacement character (which is IMHO also the best way of handling this). You can simply try it out by yourself:

  1. Go to the Gitea commit view
  2. Copy the text between the '
  3. Paste it here in a comment field on GitHub
  4. See how it is rendered by switching to the Preview tab

GitHub's rendering of unrenderable characters

@zeripath
Copy link
Contributor

@alexanderadam
Copy link
Author

https://github.com/zeripath/pathological/blob/be-broken/regular_text_file.rb is how it appears on github.com

So similar to the commit view of Gitea (when you're copying the test, you're getting the 'correct' character as well).
Thus, if the state of Gitea should be on par with the view of GitHub, the 'regular Gitea view' must be fixed, right?

@mrsdizzie
Copy link
Member

mrsdizzie commented Oct 23, 2020

I this case it is because http.DetectContentType returns application/octet-stream for this example:

return strings.Contains(http.DetectContentType(data), "text/")

https://golang.org/pkg/net/http/#DetectContentType

Thats also what my system returns locally when checking out the repo...maybe we need a different test for text then though this file is telling everybody else "im not a text file" so ...

@silverwind
Copy link
Member

Sounds lika a golang bug if � is detected as binary.

https://mimesniff.spec.whatwg.org/#identifying-a-resource-with-an-unknown-mime-type

@alexanderadam
Copy link
Author

It's an entirely different issue and yet related:

It the comments Gitea doesn't render the character like GitHub does ().
Should it it stay this way or adapted to GitHub's logic (or to put it differently: should I open an issue for that)?

@mrsdizzie
Copy link
Member

Comments seem OK to me (I left an example that looks fine) and I suspect that is just another copy/paste issue or something.

To focus on this issue: It isn't just golang that detects mime type like this, the file command seems to also. So if there is a bug its in the general spec of mime type signatures maybe.

But I bet Github just doesn't use mime type detection for text files for these reasons.

We could instead maybe do something like this:

// IsTextFile returns true if file content format is plain text or empty.
func IsTextFile(data []byte) bool {
   if len(data) == 0 {
   	return true
   }
   return utf8.Valid(data)
}

Which seems to work in a few simple tests including this example

@silverwind
Copy link
Member

utf8.Valid sounds like a good idea for binary detection. That linked spec only does it on the first 1445 bytes and I generally think we should limit that parsing for performance reasons.

@zeripath
Copy link
Contributor

Well 0x1e is a control character - I think it's worth noting that well behaved documents should not have. I mean it's invalid in XML.

@alexanderadam
Copy link
Author

I mean it's invalid in XML.

Also every Go code is invalid XML and yet, Gitea is able to present Go code properly, right?
I'm not sure how we ended up that XML is the reference here. 😉
Other source code obviously can contain different kinds of control characters.

Furthermore GitHub, which seem to be a reference for many things in Gitea, is handling it properly, too.

I absolutely agree that it's an edge case but I'm really curious how we got to XML conformity here. 🤔

@zeripath
Copy link
Contributor

Have you considered that that the fact that it is invalid in w3c text formats like XML, HTML etc might be the reason why the content type is detected as binary...?

@alexanderadam
Copy link
Author

So the current check whether a file is a binary file is checking whether the file would be XML?
Again: this would exclude many code files from other languages, since Go or C code isn't a W3C format.

I mean, guessing file types is definitely difficult. A complicated content check could sometimes be worse than just mapping the file extension to a mime type. And even going for the magic bytes can lead to false results.

Or do you want to imply that the browser decides whether the View Raw view is shown instead the file contents? 🤔

@zeripath
Copy link
Contributor

No I'm explaining that the reason why detectcontenttype is saying that the file is binary is because the character is not allowed in web text formats. (Even if browsers don't just barf on them they're supposed to be escaped in html etc.)

How do you propose we detect binary formats?

I ask in all honesty because the technique used in git itself is pretty horrible and fatally flawed - IIRC it's simply does the file contain a NUL (0x0) character within the first 1kb. This is why git handles UTF-16 formats badly.

The next problem we face is the detect content encoding problem. Because people persist in committing documents not in UTF-8 and in encodings like CP-1252 we need to detect these to show them. In general encoding detection libraries do not expect to see non CR or LF control characters and will assign a very low likelihood to any encoding.

Then once you've got those sorted you'll need to escape the control characters as they are absolutely not supposed to be in html documents - (I don't know whether highlighter is doing the correct thing above) - they should therefore be replaced.

Dealing with text encodings is not easy. It is subtle and full of problems. The utf8.Valid solution noted above would fail non-utf8 character encodings.

@alexanderadam
Copy link
Author

Thank you Zeri, your last comment indeed explains your thoughts pretty good. 👍

@stale
Copy link

stale bot commented Dec 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. I am here to help clear issues left open even if solved or waiting for more insight. This issue will be closed if no further activity occurs during the next 2 weeks. If the issue is still valid just add a comment to keep it alive. Thank you for your contributions.

@stale stale bot added the issue/stale label Dec 25, 2020
@lunny lunny added the issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented label Dec 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented issue/stale
Projects
None yet
Development

No branches or pull requests

5 participants