recoding a buffer coding system

Discussion:

(too old to reply)

Santiago Mejia

2009-08-14 21:31:46 UTC

I am trying to make a script that downloads a webpage and reformats it
into enriched-mode.

I have managed to successfully reformat the raw html tags into
enriched-mode tags. However, when I try to display the buffer in
enriched-mode (by using (format-decode-buffer 'text/enriched)), some of
the non-ASCII characters get screwed up (German umlauts, to be precise).

I have managed to solve the issue through a nasty trick: saving the
file, killing the buffer, and reopening the file. But this is a trick
that I would like to avoid.

Any ideas as to how to do it?

Santiago.

Eli Zaretskii

2009-08-15 06:36:32 UTC

Permalink

Date: Fri, 14 Aug 2009 16:31:46 -0500
I have managed to successfully reformat the raw html tags into
enriched-mode tags. However, when I try to display the buffer in
enriched-mode (by using (format-decode-buffer 'text/enriched)), some of
the non-ASCII characters get screwed up (German umlauts, to be precise).

Get screwed how, exactly?

I have managed to solve the issue through a nasty trick: saving the
file, killing the buffer, and reopening the file. But this is a trick
that I would like to avoid.

You didn't say how reopening the file helps you avoid the problem.
Knowing that might suggest a way for how to avoid the trick.

Also, can you post a minimal file and some minimal code to reproduce
the problem? There could be a bug somewhere.

Finally, what version of Emacs is that?

Peter Dyballa

2009-08-15 08:26:13 UTC

Permalink

Post by Santiago Mejia
I have managed to successfully reformat the raw html tags into
enriched-mode tags. However, when I try to display the buffer in
enriched-mode (by using (format-decode-buffer 'text/enriched)),
some of
the non-ASCII characters get screwed up (German umlauts, to be
precise).

How? What do they look like/how are they presented to you? Have you
looked at the mode-line and which encoding it displays for the
buffers? Does the downloaded file have a header in which an encoding
is listed?

Your eMail user agent runs in GNU Emacs 22.1 – are you using it also
for the purpose of downloading and reformatting HTML? GNU Emacs 23.1,
the Unicode Emacs, might give better results...

--
Greetings

Pete

By filing this bug report you have challenged the honor of my family.
Prepare to die!

Santiago Mejia

2009-08-15 14:31:40 UTC

Permalink

Post by Eli Zaretskii
Also, can you post a minimal file and some minimal code to reproduce
the problem? There could be a bug somewhere.

Sorry for the lack of detail.

The method I am using to download the page is:

(switch-to-buffer (url-retrieve-synchronously "http://www.wordreference.com/deen/grun"))

In the buffer *http www:wordreference.com:80* I see the character that
firefox displays as "ü" (u with umlaut) as \303\274. When I try to copy
and paste it here in this e-mail, however, it appears as: "Ã¼" (that is
also what happens when I try returning this buffer (buffer-string) and
inserting the returned buffer it into another buffer).

As I said, however, if I merely save and reopen the file, the characters
get shown properly.

In case this is useful, in the buffer *http www:wordreference.com:80*
the variable 'buffer-file-coding-system' is mule-utf-8

And yes: I am using Emacs 22.1.1.

Santiago.

Peter Dyballa

2009-08-15 15:15:01 UTC

Permalink

Post by Santiago Mejia
In the buffer *http www:wordreference.com:80* I see the character that
firefox displays as "ü" (u with umlaut) as \303\274.

LATIN SMALL LETTER U WITH DIAERESIS is U+00FC. It is saved as C3 BC
(hex) or \303 \274. So you get a correct byte representation.

Post by Santiago Mejia
When I try to copy
and paste it here in this e-mail, however, it appears as: "Ã¼"

Because LATIN CAPITAL LETTER A WITH TILDE is U+00BC and VULGAR
FRACTION ONE QUARTER is U+00BC and these two bytes are presented as
if belonging into some ISO Latin encoding.

Post by Santiago Mejia
As I said, however, if I merely save and reopen the file, the
characters
get shown properly.

Yes, GNU Emacs now interprets the two bytes as one Unicode character.

Post by Santiago Mejia
In case this is useful, in the buffer *http www:wordreference.com:80*
the variable 'buffer-file-coding-system' is mule-utf-8

In the end? When you re-open a second time?

The problem probably is that url-retrieve-synchronously fetches a
byte stream which is fed into a 7-bit (?) encoding buffer, so Unicode
encoded characters end up as two (or more) bytes which are display in
octal because their character codes are inappropriate for this encoding.

Me, working in GNU Emacs 23.1.50 and 22.3, see no octal codes, I only
see the bytes from the UTF-8 encoded umlauts etc. according to HTML
property "charset=utf-8." The buffer is in actual no encoding at all,
and so you're lucky that it's contents is saved as UTF-8! Therefore
no information is lost and obviously GNU Emacs uses the proper
encoding when it opens the *file* now.

Maybe using

(modify-coding-system-alist 'process "<some thing>" 'utf-8)

makes GNU Emacs handle the buffer, associated with no file and with
no process, more like it should... I haven't found the proper setting!

--
Greetings

Pete

Time is an illusion. Lunchtime, doubly so.

Eli Zaretskii

2009-08-15 15:24:33 UTC

Permalink

Date: Sat, 15 Aug 2009 09:31:40 -0500
=20
(switch-to-buffer (url-retrieve-synchronously "http://www.wordrefer=

ence.com/deen/grun"))

=20
In the buffer *http www:wordreference.com:80* I see the character t=

hat

firefox displays as "=C3=BC" (u with umlaut) as \303\274.

\303\274 is the UTF-8 representation of =C3=BC. I'm guessing that th=
e
buffer where it is displayed as \303\274 is a unibyte buffer.

As I said, however, if I merely save and reopen the file, the chara=

cters

get shown properly.

Does it help to say "M-: (set-buffer-multibyte t) RET"?

Santiago Mejia

2009-08-16 02:29:09 UTC

Permalink

Post by Peter Dyballa

Post by Santiago Mejia
In the buffer *http www:wordreference.com:80* I see the character that
firefox displays as "ü" (u with umlaut) as \303\274.
When I try to copy
and paste it here in this e-mail, however, it appears as: "Ã¼"
As I said, however, if I merely save and reopen the file, the
characters get shown properly.

Me, working in GNU Emacs 23.1.50 and 22.3, see no octal codes, I only
see the bytes from the UTF-8 encoded umlauts etc. according to HTML
property "charset=utf-8." The buffer is in actual no encoding at all,
and so you're lucky that it's contents is saved as UTF-8! Therefore
no information is lost and obviously GNU Emacs uses the proper
encoding when it opens the *file* now.

This is strange. I just installed emacs 23.0.60.1 (the emacs23 that
comes with Ubuntu --called emacs-snapshot) and I find the same exact
result: I still see the same \303\274 character for ü when I call:

(switch-to-buffer (url-retrieve-synchronously "http://www.wordreference.com/deen/grun"))

Post by Peter Dyballa

Post by Santiago Mejia
In case this is useful, in the buffer *http www:wordreference.com:80*
the variable 'buffer-file-coding-system' is mule-utf-8

In the end? When you re-open a second time?

No. In the beginning, before saving (Actually, I save and re-open the file with a
different name). When I re-open the file, buffer-file-coding-system is
utf-8-unix.

Post by Peter Dyballa
Maybe using
(modify-coding-system-alist 'process "<some thing>" 'utf-8)
makes GNU Emacs handle the buffer, associated with no file and with
no process, more like it should... I haven't found the proper
setting!

I will try to use your suggestion, but this will entail going through
the documentation and try to understand it. This weekend,
unfortunately, I will not have the time to do so.

Any further help is appreciated.

Santiago.

Santiago Mejia

2009-08-16 02:33:55 UTC

Permalink

Post by Eli Zaretskii

Post by Santiago Mejia
As I said, however, if I merely save and reopen the file, the characters
get shown properly.

Does it help to say "M-: (set-buffer-multibyte t) RET"?

No. Nothing happen when I call this function.

Any further ideas?

Peter Dyballa

2009-08-16 02:55:48 UTC

Permalink

Post by Santiago Mejia
I just installed emacs 23.0.60.1 (the emacs23 that
comes with Ubuntu --called emacs-snapshot) and I find the same exact

Me too, in GNU Emacs 23.1.50. Maybe the function comes from a world
of 7-bit US ASCII only...

--
Mit friedvollen Grüßen

Pete

Competition is the great eroder of profits.

Eli Zaretskii

2009-08-16 03:17:07 UTC

Permalink

Date: Sun, 16 Aug 2009 04:55:48 +0200
=20
=20
=20

Post by Santiago Mejia
I just installed emacs 23.0.60.1 (the emacs23 that
comes with Ubuntu --called emacs-snapshot) and I find the same ex=

act

Post by Santiago Mejia
result: I still see the same \303\274 character for =FC when I ca=

=20
=20
Me too, in GNU Emacs 23.1.50. Maybe the function comes from a world=

=20

of 7-bit US ASCII only...

Sounds like a bug that should be reported.

Santiago Mejia

2009-08-16 13:49:25 UTC

Permalink

Post by Eli Zaretskii

Date: Sun, 16 Aug 2009 04:55:48 +0200

Post by Santiago Mejia
I just installed emacs 23.0.60.1 (the emacs23 that
comes with Ubuntu --called emacs-snapshot) and I find the same exact

Me too, in GNU Emacs 23.1.50. Maybe the function comes from a world
of 7-bit US ASCII only...

Sounds like a bug that should be reported.

Probably there is a bug... however, there is something that emacs is
doing right in the process of writing and re-opening the file.

I tried debugging my program, by going step by step through the
(write-file "foo") and (insert-file-contents "foo") functions, to see if
I could figure out where was the conjuring trick done. However, I did
not quite found it (that is why I appealed to the list).

Any ideas as to what should I look for in debugging these functions?
(perhaps what are the likely functions that emacs is using that I could
hack from emacs itself, so as not to have to save and re-open?)

Is my best bet to look at the (write-file "foo") or at
(insert-file-contents "foo)?

S.

Eli Zaretskii

2009-08-16 17:06:53 UTC

Permalink

Date: Sun, 16 Aug 2009 08:49:25 -0500
I tried debugging my program, by going step by step through the
(write-file "foo") and (insert-file-contents "foo") functions, to see if
I could figure out where was the conjuring trick done. However, I did
not quite found it (that is why I appealed to the list).

I'm quite sure it works because insert-file-contents decodes the UTF-8
sequences into Unicode characters. That's not a bug, but the correct
behavior.

The bug seems to be in url-retrieve-synchronously, so you might as
well stop looking at write-file and insert-file-contents.

Peter Dyballa

2009-08-16 21:09:44 UTC

Permalink

Post by Eli Zaretskii
Sounds like a bug that should be reported.

I think there is no bug in url-retrieve-synchronously! This function
needs to be kind of universal, i.e., don't assume or set anything.
From the internet one can download anything, 7-bit US-ASCII, 8-bit
umlauts, Unicodes – and real binary data (PDF, JPEG, MPEG,...). It
would be nice if this function would accept another argument, the
encoding for the buffer created. Right now the user has to take care
of this, because the user knows what kind of "data" will be (or
already was) downloaded. The variables save-buffer-coding-system or
buffer-file-coding-system determine how the buffer will be saved in a
file. And this should suffice...

--
Greetings

Pete

Theory and practice are the same, in theory, but, in practice, they
are different.