-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeing regular files with special characters in it #13279
Comments
What character does the filename contain? |
The file contains this # frozen_string_literal: true
puts 'Foo�bar!' and the character causing this is " You can also inspect / checkout the example repo if it helps. |
That's the unicode replacement character which essentially should be considered as text, not binary. How does GitHub handle this case? |
I pasted the exact character. GitHub is only showing the Unicode replacement character (which is IMHO also the best way of handling this). You can simply try it out by yourself:
|
https://github.com/zeripath/pathological/blob/be-broken/regular_text_file.rb is how it appears on github.com |
So similar to the commit view of Gitea (when you're copying the test, you're getting the 'correct' character as well). |
I this case it is because http.DetectContentType returns application/octet-stream for this example: Line 357 in bfc5531
https://golang.org/pkg/net/http/#DetectContentType Thats also what my system returns locally when checking out the repo...maybe we need a different test for text then though this file is telling everybody else "im not a text file" so ... |
Sounds lika a golang bug if � is detected as binary. https://mimesniff.spec.whatwg.org/#identifying-a-resource-with-an-unknown-mime-type |
It's an entirely different issue and yet related: It the comments Gitea doesn't render the character like GitHub does ( |
Comments seem OK to me (I left an example that looks fine) and I suspect that is just another copy/paste issue or something. To focus on this issue: It isn't just golang that detects mime type like this, the file command seems to also. So if there is a bug its in the general spec of mime type signatures maybe. But I bet Github just doesn't use mime type detection for text files for these reasons. We could instead maybe do something like this: // IsTextFile returns true if file content format is plain text or empty.
func IsTextFile(data []byte) bool {
if len(data) == 0 {
return true
}
return utf8.Valid(data)
} Which seems to work in a few simple tests including this example |
|
Well 0x1e is a control character - I think it's worth noting that well behaved documents should not have. I mean it's invalid in XML. |
Also every Go code is invalid XML and yet, Gitea is able to present Go code properly, right? Furthermore GitHub, which seem to be a reference for many things in Gitea, is handling it properly, too. I absolutely agree that it's an edge case but I'm really curious how we got to XML conformity here. 🤔 |
Have you considered that that the fact that it is invalid in w3c text formats like XML, HTML etc might be the reason why the content type is detected as binary...? |
So the current check whether a file is a binary file is checking whether the file would be XML? I mean, guessing file types is definitely difficult. A complicated content check could sometimes be worse than just mapping the file extension to a mime type. And even going for the magic bytes can lead to false results. Or do you want to imply that the browser decides whether the |
No I'm explaining that the reason why detectcontenttype is saying that the file is binary is because the character is not allowed in web text formats. (Even if browsers don't just barf on them they're supposed to be escaped in html etc.) How do you propose we detect binary formats? I ask in all honesty because the technique used in git itself is pretty horrible and fatally flawed - IIRC it's simply does the file contain a NUL (0x0) character within the first 1kb. This is why git handles UTF-16 formats badly. The next problem we face is the detect content encoding problem. Because people persist in committing documents not in UTF-8 and in encodings like CP-1252 we need to detect these to show them. In general encoding detection libraries do not expect to see non CR or LF control characters and will assign a very low likelihood to any encoding. Then once you've got those sorted you'll need to escape the control characters as they are absolutely not supposed to be in html documents - (I don't know whether highlighter is doing the correct thing above) - they should therefore be replaced. Dealing with text encodings is not easy. It is subtle and full of problems. The utf8.Valid solution noted above would fail non-utf8 character encodings. |
Thank you Zeri, your last comment indeed explains your thoughts pretty good. 👍 |
This issue has been automatically marked as stale because it has not had recent activity. I am here to help clear issues left open even if solved or waiting for more insight. This issue will be closed if no further activity occurs during the next 2 weeks. If the issue is still valid just add a comment to keep it alive. Thank you for your contributions. |
Description
I cannot see a file if it contains special chars. Is there any way to enforce watching the file anyway?
See this repo for instance.
The file contains this
And it makes sense that gitea thinks, that this could be a binary file because of the
�
. But it obviously isn't a binary file and I would love to see the file anyway. Instead of having to download it.Funnily enough, the commit view shows the content anyway. So something is inconsistent here.
Screenshots
The text was updated successfully, but these errors were encountered: