Upload File With Any Charset


(Konapaz) #1

Hi

I try to upload filename with any character (English, Greek, German, Chinese etc)

But the uploaded file has different name

for example

ελληνικοιχαρακτήρες.txt

Englishcharacters.txt

中国文字.txt

the uploaded file succeed only with english character

If I use




$name_file = iconv('UTF-8', 'greek//TRANSLIT', $model->file->name);

$model->file->saveAs('documents/' . $model->name_file);

English and Greek succeed but not Chinese characters.

How to make it for all languages ?

Thanks!


(Alirz23) #2

I would not get into that I would simply create a simple mapping in the database to store the original filename and then use md5(filename) as actual filename and use that as lookup php is not really good at unicode encoding at the moment although its getting better there are some filesystem issues as well


(Konapaz) #3

Hi Alirz thanks for your response

In my case I have to organize and give the access to the users to download the files with the original name,

so what you suggest in this case ?


(Alirz23) #4

as I said create a mapping in your uploads tables

fake_file_name // use this to store the files on filesystem

original_file_name // use this for links since you want your users to download the file with original name


(Konapaz) #5

Basically, users could have access directly on the folder. (through http, ftp). According to your suggestion I have to re-generate all the folder-structure with its files in original name (using maps-datatable) right ?

Is there another way? (The simplest is upload the file with the same name but how ?

As I said in the start of this topic there are issues of encoding…


(Alirz23) #6

As I said you don’t have to restructure anything its simple enough for anyone to make sense


(Konapaz) #7

Maybe I didn’t make it clearly

According to your suggestion

original file (using for url) --> fileName to Server

ελληνικοιχαρακτήρες.txt --> daf003543b1c504e44342cdcce8af58a.txt

Englishcharacters.txt --> 8a48ad1d22cb6a456ec006284eefe36b.txt

中国文字.txt --> 8f30502091d569e6649d198cdfa8598a.txt

So the files in the server will be encoded with md5

I know how to make the download link like that

<a href=‘folder_a/8a48ad1d22cb6a456ec006284eefe36b.txt’>Englishcharacters.txt</a>

But I have to make it for each file

As I said I want the user to have access directly to folder_a (for example using ftp), so user can’t see the original name of file but only the encoded name

So how to upload filenames and store it with original namefile ?


(Tomasz) #8

This is quite normal and I don’t see anything wrong. You told (with first line in example) PHP, that it should do conversion from Greek to UTF-8 (or the other way, it doesn’t matter here), so it was able to properly handle only English (no special characters) and Greek special characters filenames. There is no way, this could handle Chineese or any other special characters (like Polish for example). I would be very surprised, if that would work.

If you would use the same line, but gave Chineese code as second parameter for iconv, your script would handle English and Chineese characters without any problems, but would fail on Greek and any other. This seems to be logic (at least for me).

First of all, storing files on server in any character set other than English is a hellish idea and a complete madness! Your PHP and Apache supports UTF-8, but your file system certainly not. You’ll end up with doubled files, files with incorrect filename, not-downloadable files etc. etc. You’re asking yourself for a real troubles. Are you boring and looking for some challenges? :]

Even, if you can assure, that your server’s (Linux?) file system is 100% UTF-8 ready and can write UTF-8 encoded filenames, HTTP upload protocol will cause another large bunch of troubles (a piece of which you have already tasted), if you attempt to transfer files with non-English characters in names.

If you would like to support all the languages, with above (iconv) method, you would have to:

  • find the way to determine, in which language or alphabet filename of file is written (is it possible at all?),
  • transfer this language setting along with transmitted file,
  • set iconv second parameter according to transmitted value of detected language.

This is madness, let me underline this again. For example, pilot is a word valid in English, Polish and probably many other languages. The same as stop. You can name hundreds of such examples. How you’re going to detect language of filename correctly in this case? Take some time and test Google Translate with language autodetection option enabled, to see how often it made mistakes.

You can ask user to set language of his file’s filename (using some combobox for example). But, what, if he made a wrong selection? This is even bigger madness.

My advice: the only reasonable solution here is to store filenames in English only and break file transfer, if you detect, that it contains non-English (non-Latin actually) characters in filename.

If you really need to support non-Latin character names and there is no other way (kill the project manager, tell the customer, that implementation of this will cost a million dollars and hire someone to write a new PHP for you), you maybe can consider letting users to transfer files directly via FTP (no HTTP file upload) and somehow bind files transfered this way with your application. This is also a madness. Take some time and test Total Commander (which has quite good FTP client on-board) to see how many times it gets wako, if you try to upload or download any file with non-Latin filename.

I had so many problems with simple French “e” (they’ve got five different of them there, with and without accents pointed to left, right, etc.), which made may server to go completely wako and to generate two separate files (one with “e” with accent and one with “e” without accent) and to do a lot more stupid things. I said then to myself: no f*ing way! Hell is going to get frozen earlier than I’m going to let users upload files with non-Latin characters in filenames.


(Konapaz) #9

Hi Trejder,

I suppose that was more easy than it!.. Database (using utf8_general_ci) is compatible with any character.

It seems the file-system is much more complex (according Linux-windows operating system, url’s etc)

If the server file-system is in utf-8 then has the same problems ?

I just test it with my gmail attaching files with no-latin characters. The name files download correctly (with original filename) but the url’s is not directly to the server url path. (http header is used)

So the solution is using map between original filename and stored files (using encoded) ?

Thanks

PS: Excellent explanation ;)

I give you a vote


(Tomasz) #10

I can’t tell you! For me, even thinking about this causes a headache! :]

First of all, this Google. This looks like mapping, alirz23 been talking about. But, it actually only looks like mapping, while in fact can have nothing to do with this. The entire file you attach, can be stored in database and nothing on server. They have endless free space and insane big databases.

Second of all, this is Google. They store, measure, count, track, write anything. And have special tool for that, so they upload tool is most likely something sophisticated and you can’t compare anything to it.

Third of all, this is Google. I don’t like them. Which doesn’t change the fact, that I can’t live without their tools. World is sad! :[

Thanks. I have a good day. Me 2! :}


(Softark) #11

This is trying to convert UTF-8 to Greek. So it should fail for Chinese, as Trejder says.

But, do you have to convert the file name to Greek? Is yours a Windows server?

And, do we have to convert the file name when the server is Linux using utf-8 for its file system?

I think $_FILES[‘name’] should be already written in utf-8 when the web page’s encoding is utf-8.

So, I guess the conversion is not required in this case.

Um, I’m not sure. Sorry if I’m wrong. Actually I have never tried to save the uploaded files with their original names.


(Konapaz) #12

Hi softark

Sure, using greek it is not able to store chinese or any orther no-english characters (I post it as example)

I was test it on wampserver (on windows) but I don’t know what will happens on linux server

I tried to find a solution using iconv and autodetect system-file encoding using php but without luck (I want to works in different operation systems)

The page is encoded in utf-8 but the server maybe is not. So maybe a solution is something like that

iconv ( ‘UTF-8’ , ANY_FUNCTION_AUTO_DETECT_SYSTEM_ENCODE_FILE , ‘the_uploaded_name_file’ ) …


(Softark) #13

I’ve recalled why I myself followed the way alirz23 has suggested.

The biggest con about using the original file names for saving is that the user can easily overwrite the existing file with the uploaded file because of the file name collision.

[EDIT]

After all, it looked me that saving files with unique names (using hash, timestamp, serial number or something else) and storing the original file names in the file management database would be more simple, robust and flexible.


(Alirz23) #14

very difficult to implement even if you get it working there are going to be others problems too filesystem for one your php, apache, headers … this is a nightmare I would rather a write a client to handle the file downloads and uploads just avoid the ftp completely


(Konapaz) #15

Guys,

You convinced me!!

I will make a database table maps with related name files.

Also I create a http service that illustrate an ftp only for view and download files

Thanks all of you! :)