text encoding problem with html bitstreams

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

text encoding problem with html bitstreams

Petya Kohts
Hello,

following up on the discussion

https://sourceforge.net/mailarchive/message.php?msg_id=31215737
http://web.archiveorange.com/archive/v/hxciqsTWLSVu2rG3JE47

started by James Leonard Halliday with the subject
"text encoding problem with bitstreams in DSpace 3.1 - resolved":

> Hi everyone,
>
> I posted about this a while back, and finally found a workaround
> so I wanted to share. My problem was regarding HTML bitstreams
> in DSpace 3.1 (XMLUI).
>
> In previous versions of DSpace, the encoding for my UTF-8 bitstreams
> worked just fine, but in DSpace 3.1, the encoding for ONLY the bitstreams
> was coming out as ISO-8859 instead. After much searching, I finally found
> a workaround.

First of all thanks for sharing, Leonard!


Now Dspace 4.0 rc3 has the same problem and the same fix helps:

in /dspace/webapps/xmlui/WEB-INF/web.xml replace:

  <filter>
    <filter-name>SetCharacterEncoding</filter-name>
    <filter-class>org.dspace.app.xmlui.cocoon.SetCharacterEncodingFilter</filter-class>
    <init-param>
      <param-name>encoding</param-name>
      <param-value>UTF-8</param-value>
    </init-param>
  </filter>

with

  <filter>
    <filter-name>SetCharacterEncoding</filter-name>
    <filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
    <init-param>
      <param-name>encoding</param-name>
      <param-value>UTF-8</param-value>
    </init-param>
    <init-param>
      <param-name>forceEncoding</param-name>
      <param-value>true</param-value>
    </init-param>
  </filter>


Answering Mark's questions:

> I'm thinking that our filter as written could never have done what you
> expect, and the effect was produced elsewhere.  Our filter only sets
> the request's encoding.  Spring's filter is documented to also set the
> response's encoding when forceEncoding=true.  Perhaps BitstreamReader
> should just set the encoding on the response?

It seems that Spring's filter not only forces encoding for text/html,
but also converts the file. Please check out the results with default
Dspace web.xml (web.xml.old.data and web.xml.old.head) and
modified as described above (web.xml.new.data and web.xml.new.head):

root@dspace4-test:~# ls -l
total 52
-rw-r--r-- 1 root root 15140 Jan 13 15:06 web.xml.new.data
-rw-r--r-- 1 root root   384 Jan 13 15:06 web.xml.new.head
-rw-r--r-- 1 root root 24656 Jan 13 15:05 web.xml.old.data
-rw-r--r-- 1 root root   389 Jan 13 15:06 web.xml.old.head

web.xml.*.data files were obtained by running "lynx  --dump"
and web.xml.*.head files were obtained by "lynx --head --dump"


As you can see headers differ as one would expect:

root@dspace4-test:~# diff -u web.xml.old.head web.xml.new.head
--- web.xml.old.head    2014-01-13 15:06:00.036506000 -0500
+++ web.xml.new.head    2014-01-13 15:06:52.800506000 -0500
@@ -1,14 +1,14 @@
 HTTP/1.1 200 OK
 Server: Apache-Coyote/1.1
-Set-Cookie: JSESSIONID=818F618169946A0770D8DE6A572348E5; Path=/xmlui/; HttpOnly
+Set-Cookie: JSESSIONID=9A3ADC942740B4A31CF0AC971CD4BCBB; Path=/xmlui/; HttpOnly
 X-Cocoon-Version: 2.2.0
 Vary: User-Agent
 Last-Modified: Mon, 13 Jan 2014 19:50:43 GMT
-Expires: Mon, 13 Jan 2014 21:06:00 GMT
-Content-Type: text/html;charset=ISO-8859-1
+Expires: Mon, 13 Jan 2014 21:06:52 GMT
+Content-Type: text/html;charset=UTF-8
 Content-Language: en
 Content-Length: 18139
-Date: Mon, 13 Jan 2014 20:06:00 GMT
+Date: Mon, 13 Jan 2014 20:06:52 GMT
 Connection: close


But data files also differ (check out the size) and this:

root@dspace4-test:~# cat web.xml.new.data | head -n 3 | hd
00000000  d0 92 d0 b2 d0 b5 d0 b4  d0 b5 d0 bd d0 b8 d0 b5  |................|
00000010  0a 0a d0 9f d1 80 d0 be  d0 b2 d0 b5 d1 80 d0 b5  |................|
00000020  d0 bd d0 be 3a 20 32 30  20 d0 bc d0 b0 d1 80 d1  |....: 20 .......|
00000030  82 d0 b0 20 31 39 34 38  20 d0 b3 d0 be d0 b4 d0  |... 1948 .......|
00000040  b0 0a                                             |..|
00000042

root@dspace4-test:~# cat web.xml.old.data | head -n 3 | hd
00000000  c3 90 c3 90 c2 b2 c3 90  c2 b5 c3 90 c2 b4 c3 90  |................|
00000010  c2 b5 c3 90 c2 bd c3 90  c2 b8 c3 90 c2 b5 0a 0a  |................|
00000020  c3 90 c3 91 c3 90 c2 be  c3 90 c2 b2 c3 90 c2 b5  |................|
00000030  c3 91 c3 90 c2 b5 c3 90  c2 bd c3 90 c2 be 3a 20  |..............: |
00000040  32 30 20 c3 90 c2 bc c3  90 c2 b0 c3 91 c3 91 c3  |20 .............|
00000050  90 c2 b0 20 31 39 34 38  20 c3 90 c2 b3 c3 90 c2  |... 1948 .......|
00000060  be c3 90 c2 b4 c3 90 c2  b0 0a                    |..........|
0000006a


Petya.

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Dspace-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspace-devel