Quantcast

[Dspace-tech] Scalability issues report, DSpace@Cambridge

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Dspace-tech] Scalability issues report, DSpace@Cambridge

Tom De Mulder
DSpace scalability issues report, per wiki template:

1. DSpace@Cambridge, The University of Cambridge, UK.
   Technical contacts: Tom De Mulder, [hidden email] (systems manager)
     Simon Brown [hidden email] (DSpace developer)

2. a. DSpace version 1.6.2 with extensive local patches, using JSPUI
      Size: 137 communities, 258 collections, >200k items, 12TB, 436k bitstreams (excluding licenses)

   b. PostgreSQL 8.4.4

   c. Tomcat 6.0.24 standalone

   d. Separate servers for webapp, DB, storage and ancillary functions
      Webapp/DB servers are HT 8-core Intel servers running Ubuntu Linux
      with 16GB of memory and fast local storage
      Java memory: -Xmx2048M -Xms2048M

3. a. - Unless Tomcat is restarted, it will consistently fail due to lack of memory in less than 48 hours.
      - Batch importer: will fail on large batch imports (order of thousands of items), performance degrades with size of repository and of batch.
      - Search indexer: fails on large repositories, slowing down and eventually running out of memory.
      - Assetstore: random structure causes large overhead on filesystem for no real gain
     
      See also our poster, presented in Gothenburg: http://tools.dspace.cam.ac.uk/DSUG09%20A2%20poster.pdf

   b. Installed vanilla DSpace 1.6.2, imported 200k randomly generated items, ran siege against it, watched it not cope.
      We've done profiling in the past, but not for 1.6. However, we've not noticed significant changes in the code that has issues.

   c. We have patches for the indexer; batch importer; thumbnail and PDF text extraction; assetstore structure; dark item masking in OAI and browse code

4. We can't commit to volunteering unless this can be incorporated into the work we need to undertake in our primary capacity of running the University's Institutional Repository. However, we would be willing to try and make this happen.


--
Tom De Mulder <[hidden email]> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Dspace-tech] Scalability issues report, DSpace@Cambridge

Tom De Mulder
(Apologies for replying to my own email.)

One metric the template didn't ask for, I just noticed, is the number of hits per second.

We average about 2 hits per second, which is very low, even if most of these hits are actual page views, not just layout elements. However, both our webapp and database servers are under constant load, the latter in particular.

Actual load average numbers are meaningless for comparison because they depend so much on the way the OS kernel implements them, so I won't give them. Suffice to say, though, that we had to ask the people running our university search engine and similar services to throttle their index rate so the servers wouldn't get overloaded.

Also of note is that the problems are mostly on the database and webapp end, there are no problems with I/O (disk or network).


--
Tom De Mulder <[hidden email]> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Dspace-tech] Scalability issues report, DSpace@Cambridge

Tim Donohue-3
In reply to this post by Tom De Mulder
Hi Tom,

First off, thanks for sharing your details, and some more specifics
about the memory issues you are seeing.  Much appreciated.

I've got a few followup questions, if you or Simon don't mind answering
them.  Just trying to get a better understanding of where the core
problem(s) reside, so that we can make useful suggestions to you (and
find ways to resolve issues in DSpace itself).  I've also noted a few
areas where you might be able to achieve some temporary benefits (if you
need something temporarily, at least until we can fix any issues in DSpace).

On 10/7/2010 5:32 AM, Tom De Mulder wrote:

> DSpace scalability issues report, per wiki template:
>
> 1. DSpace@Cambridge, The University of Cambridge, UK.
>     Technical contacts: Tom De Mulder, [hidden email] (systems manager)
>       Simon Brown [hidden email] (DSpace developer)
>
> 2. a. DSpace version 1.6.2 with extensive local patches, using JSPUI
>        Size: 137 communities, 258 collections,>200k items, 12TB, 436k bitstreams (excluding licenses)
>
>     b. PostgreSQL 8.4.4

In order to better understand your PostgreSQL configs, would you be
willing to share how your "work_mem" and "shared_buffers" are
configured?  Or, if you could share the whole Postgres config, that
could also help.

I just know there are sometimes ways to performance tune PostgreSQL for
larger sized databases (which yours surely is, based on the amount of
content).  If you've already investigated PostgreSQL performance tuning,
it'd be good to know as well.  Here's some very basic info from our wiki
(and much more off the PostgreSQL site as well), which shows settings
we'd be looking to tune for potentially better DB performance:

https://wiki.duraspace.org/display/DSPACE/PostgresPerformanceTuning
http://wiki.postgresql.org/wiki/Performance_Optimization

>
>     c. Tomcat 6.0.24 standalone
>
>     d. Separate servers for webapp, DB, storage and ancillary functions
>        Webapp/DB servers are HT 8-core Intel servers running Ubuntu Linux
>        with 16GB of memory and fast local storage
>        Java memory: -Xmx2048M -Xms2048M

Obviously, throwing more memory at this issue may not be a long term
solution.  But, you could think about (even temporarily) providing more
memory to Java (beyond the 2GB) to see if that can temporarily lessen
the frequency of these memory errors (likely, they may still occur at
some point though).  I'm not sure if this is possible temporary fix or
not (depends on whether the rest of that 16GB of memory is already
allocated to other applications)

> 3. a. - Unless Tomcat is restarted, it will consistently fail due to lack of memory in less than 48 hours.
>        - Batch importer: will fail on large batch imports (order of thousands of items), performance degrades with size of repository and of batch.
>        - Search indexer: fails on large repositories, slowing down and eventually running out of memory.
>        - Assetstore: random structure causes large overhead on filesystem for no real gain
>
>        See also our poster, presented in Gothenburg: http://tools.dspace.cam.ac.uk/DSUG09%20A2%20poster.pdf

Could you actually send us a few of the out of memory error messages you
are seeing?

You may be seeing out of memory errors either from PermGen
("OutOfMemoryError: PermGen space") or from Heap ("OutOfMemoryError:
Java heap space"). So, it'd be good to determine which type(s) of memory
error it is.  We might be able to help you tweak your Java settings to
at least temporarily avoid these errors being thrown. (Again, likely not
a permanent fix, but may help temporarily until we can fix any memory
usage issues in DSpace.)

Also, knowing the type(s) of memory error may be able to help us
determine what the cause in DSpace may be.

>
>     b. Installed vanilla DSpace 1.6.2, imported 200k randomly generated items, ran siege against it, watched it not cope.
>        We've done profiling in the past, but not for 1.6. However, we've not noticed significant changes in the code that has issues.

It'd be good to get an idea of what wasn't coping well in the vanilla
DSpace 1.6.2 tests you ran.

For instance, did accessing DSpace suddenly become extremely slow (e.g.
browsing/searching), or was it the import script that slowed down (or
maybe it was both)?  I guess the question is what did you have the SIEGE
program (http://www.joedog.org/index/siege-home) testing? Was it just
doing random accesses of the website and noting very poor performance or
receiving more out-of-memory errors?

If you could, it would also would be wonderful if you would share your
scripts for randomly generating 200K DSpace items. This would allow us
to do the same testing locally to replicate exactly what you have seen.
  Plus, this seems like it could be a great testing tool for the whole
community (we are always looking for a decent set of test data). Sure,
we could probably rewrite the code from scratch, but it's always better
to just grab a copy of code if you can :)

>     c. We have patches for the indexer; batch importer; thumbnail and PDF text extraction; assetstore structure; dark item masking in OAI and browse code

I'd already mentioned this before, and there's not a huge rush. But,
when you find some time, it'd be good to add some/all of these patches
to JIRA (especially those you feel are applicable to 1.6.x).  That way
we can get more eyes on them, and do a full review and (hopefully, if
all works out in our 1.7 timeline) maybe even get some/many into 1.7.0.

For more on the DSpace 1.7 timeline, see this wiki page
https://wiki.duraspace.org/display/DSPACE/DSpace+Release+1.7.0+Notes

> 4. We can't commit to volunteering unless this can be incorporated into the work we need to undertake in our primary capacity of running the University's Institutional Repository. However, we would be willing to try and make this happen.

No problem & definitely understood.

Providing this sort of performance feedback is already much appreciated.
If you are able to share your patches, a little more specifics on the
OutOfMemoryErrors, and potentially share your 'randomly generated items'
script, that'd be a huge boost for us.   It'd definitely help us to be
able to get a better sense of exactly what is happening (and where) in
DSpace, so that we can hopefully get some immediate fixes ready in time
for the 1.7.0 release.

Thanks!

- Tim

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Dspace-tech] Scalability issues report, DSpace@Cambridge

Stuart Lewis
In reply to this post by Tom De Mulder
Hi Tom,

Thanks for this extra level of information - it will really help.

A few random questions come to mind:

>   d. Separate servers for webapp, DB, storage and ancillary functions
>      Webapp/DB servers are HT 8-core Intel servers running Ubuntu Linux
>      with 16GB of memory and fast local storage
>      Java memory: -Xmx2048M -Xms2048M

Is there a reason why you only allocate 1/8th of the system memory to the application?  Have you found that adding extra doesn't help?


>  - Assetstore: random structure causes large overhead on filesystem for no real gain

Are you able to expand on the overhead that is caused, and from your profiling, explain how the structure could be improved?  My gut (and uniformed) instinct would be that since asset store reads are completely random depending on the items being viewed at the time, the layout of directories would be irrelevant.  Writes may be slightly less efficient, but since writes only tend to occur once, they are of less consequence.  


> - Search indexer: fails on large repositories, slowing down and eventually running out of memory.

Do you have any percentages on the amount of page views that relate to browse, and how many relate to other views?  I'm curious if browse from the front end is causing an issue too?  The reason I'm asking, is that with the potential inclusion of the dspace-discovery layer in a future version, this could replace the database-driven browse system with solr.  Not only will this provide a richer faceted search, but it could likely offer a good performance boost for browse-related functions.  It also offers another way of scaling-out, by putting solr on a different server.


> 4. We can't commit to volunteering unless this can be incorporated into the work we need to undertake in our primary capacity of running the University's Institutional Repository. However, we would be willing to try and make this happen.

That would be great if you could, and we'd all really appreciate your input.  Users of the software such as yourself, or BioMed Central's Open Repository are pushing the software in ways that 95% of installations don't (yet).  In order to push past these boundaries in terms of scalability the only way we can effectively do so is with the active participation of those who are encountering the problems.  One of the joys of working in an open source environment is that we have the structures and process that enable this, and it is great to watch the results when everyone pitches in together to improve the software for us all.

Cheers,


Stuart Lewis
IT Innovations Analyst and Developer
Te Tumu Herenga The University of Auckland Library
Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
Ph: +64 (0)9 373 7599 x81928


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Dspace-tech] Scalability issues report, DSpace@Cambridge

Tom De Mulder
On 7 Oct 2010, at 21:56, Stuart Lewis wrote:

>>     with 16GB of memory and fast local storage
>>     Java memory: -Xmx2048M -Xms2048M
> Is there a reason why you only allocate 1/8th of the system memory to the application?  Have you found that adding extra doesn't help?

In our experience, it merely delays when the error occurs, and we'd still need to restart. Whether we do this nightly or every other night doesn't make much difference. I'm not sure it would actually make it go faster. Additionally, we need to keep memory free for file caching and thumbnail generation; we found that if we assign too much memory to Java then the system needs to read from disk more for these other tasks and we get a slow-down there.

>> - Assetstore: random structure causes large overhead on filesystem for no real gain
> Are you able to expand on the overhead that is caused, and from your profiling, explain how the structure could be improved?  My gut (and uniformed) instinct would be that since asset store reads are completely random depending on the items being viewed at the time, the layout of directories would be irrelevant.  Writes may be slightly less efficient, but since writes only tend to occur once, they are of less consequence.  

Apologies for sounding cryptic; I was trying not to be too verbose in the template. :-)

This has mostly to do with back-ups. With about 600,000 files in random directories, it can be hard to find out what files have changed. We implemented an simple asset store structure that stores files by year/month/day. This means we can mirror new files very quickly, and only traverse the entire assetstore every other day to check if files have changed.

Maybe I should expand a bit on our storage set-up:

- our live system has about 90TB capacity, with an EMC SAN connected to a pair of Sun servers. These present them to our private network at about 4Gbps, as well as running the checksums (I wrote some Perl to do this job locally, rather than add to the I/O of the live server.)

- we have two sets of back-up servers (ZFS-based) off-site for the live system, which use rsync to mirror all this data. (Two systems because otherwise, if we lose one, it'd be vulnerable too long while the data is re-sync'ed).

A small script makes copies of the day's assetstore every hour; a complete rsync runs across assetstores (the original one as well as the new one with our own datestamp format) every alternating day, and at week-ends we run rsync with checksums. Essentially this system is copy-on-write: if a file changes on disk, the old back-up copy is moved into a holding area to be deleted when necessary, and the new file copied in its place.

Finally, the date structure for the directory/file names helps locate problem files quickly if necessary. Not a huge thing, but it makes my life easier.

>> - Search indexer: fails on large repositories, slowing down and eventually running out of memory.
> Do you have any percentages on the amount of page views that relate to browse, and how many relate to other views?  I'm curious if browse from the front end is causing an issue too?  The reason I'm asking, is that with the potential inclusion of the dspace-discovery layer in a future version, this could replace the database-driven browse system with solr.  Not only will this provide a richer faceted search, but it could likely offer a good performance boost for browse-related functions.  It also offers another way of scaling-out, by putting solr on a different server.

This question I'll have to leave to Simon to answer, so I don't make a hash of it.


Best,

--
Tom De Mulder <[hidden email]> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Dspace-tech] Scalability issues report, DSpace@Cambridge

Hilton Gibson-2


On 08/10/2010 11:13, Tom De Mulder wrote:

> On 7 Oct 2010, at 21:56, Stuart Lewis wrote:
>
>>>      with 16GB of memory and fast local storage
>>>      Java memory: -Xmx2048M -Xms2048M
>> Is there a reason why you only allocate 1/8th of the system memory to the application?  Have you found that adding extra doesn't help?
> In our experience, it merely delays when the error occurs, and we'd still need to restart. Whether we do this nightly or every other night doesn't make much difference. I'm not sure it would actually make it go faster. Additionally, we need to keep memory free for file caching and thumbnail generation; we found that if we assign too much memory to Java then the system needs to read from disk more for these other tasks and we get a slow-down there.
>
>>> - Assetstore: random structure causes large overhead on filesystem for no real gain
>> Are you able to expand on the overhead that is caused, and from your profiling, explain how the structure could be improved?  My gut (and uniformed) instinct would be that since asset store reads are completely random depending on the items being viewed at the time, the layout of directories would be irrelevant.  Writes may be slightly less efficient, but since writes only tend to occur once, they are of less consequence.
> Apologies for sounding cryptic; I was trying not to be too verbose in the template. :-)
>
> This has mostly to do with back-ups. With about 600,000 files in random directories, it can be hard to find out what files have changed. We implemented an simple asset store structure that stores files by year/month/day. This means we can mirror new files very quickly, and only traverse the entire assetstore every other day to check if files have changed.

See: http://hdl.handle.net/10019.1/3161
How strange, I also proposed such a thing !!

> Maybe I should expand a bit on our storage set-up:
>
> - our live system has about 90TB capacity, with an EMC SAN connected to a pair of Sun servers. These present them to our private network at about 4Gbps, as well as running the checksums (I wrote some Perl to do this job locally, rather than add to the I/O of the live server.)
>
> - we have two sets of back-up servers (ZFS-based) off-site for the live system, which use rsync to mirror all this data. (Two systems because otherwise, if we lose one, it'd be vulnerable too long while the data is re-sync'ed).
>
> A small script makes copies of the day's assetstore every hour; a complete rsync runs across assetstores (the original one as well as the new one with our own datestamp format) every alternating day, and at week-ends we run rsync with checksums. Essentially this system is copy-on-write: if a file changes on disk, the old back-up copy is moved into a holding area to be deleted when necessary, and the new file copied in its place.
>
> Finally, the date structure for the directory/file names helps locate problem files quickly if necessary. Not a huge thing, but it makes my life easier.
>
>>> - Search indexer: fails on large repositories, slowing down and eventually running out of memory.
>> Do you have any percentages on the amount of page views that relate to browse, and how many relate to other views?  I'm curious if browse from the front end is causing an issue too?  The reason I'm asking, is that with the potential inclusion of the dspace-discovery layer in a future version, this could replace the database-driven browse system with solr.  Not only will this provide a richer faceted search, but it could likely offer a good performance boost for browse-related functions.  It also offers another way of scaling-out, by putting solr on a different server.
> This question I'll have to leave to Simon to answer, so I don't make a hash of it.
>
>
> Best,
>
> --
> Tom De Mulder<[hidden email]>  - Cambridge University Computing Service
> +44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
>
>
> ------------------------------------------------------------------------------
> Beautiful is writing same markup. Internet Explorer 9 supports
> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2&  L3.
> Spend less time writing and  rewriting code and more time creating great
> experiences on the web. Be a part of the beta today.
> http://p.sf.net/sfu/beautyoftheweb
> _______________________________________________
> DSpace-tech mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

--
Hilton Gibson
Systems Administrator
JS Gericke Library
Room 1053
Stellenbosch University
Private Bag X5036
Stellenbosch
7599
South Africa

Tel: +27 21 808 4100 | Cell: +27 84 646 4758


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Dspace-tech] Scalability issues report, DSpace@Cambridge

Tom De Mulder
In reply to this post by Stuart Lewis
Dear all,

I'm attaching a dump of our PostgreSQL configuration to this email. We got some input from Postgres developers into how best to tune for our needs, but if someone has suggestions for things to try then we'd be happy to hear them.


Best regards,

--
Tom De Mulder <[hidden email]> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

uk.ac.cam postgresql_settings.txt (26K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Dspace-tech] Scalability issues report, DSpace@Cambridge

Stuart Lewis
In reply to this post by Hilton Gibson-2
Hi Hilton,

>>>> - Assetstore: random structure causes large overhead on filesystem for no real gain
>>> Are you able to expand on the overhead that is caused, and from your profiling, explain how the structure could be improved?  My gut (and uniformed) instinct would be that since asset store reads are completely random depending on the items being viewed at the time, the layout of directories would be irrelevant.  Writes may be slightly less efficient, but since writes only tend to occur once, they are of less consequence.
>> Apologies for sounding cryptic; I was trying not to be too verbose in the template. :-)
>>
>> This has mostly to do with back-ups. With about 600,000 files in random directories, it can be hard to find out what files have changed. We implemented an simple asset store structure that stores files by year/month/day. This means we can mirror new files very quickly, and only traverse the entire assetstore every other day to check if files have changed.
>
> See: http://hdl.handle.net/10019.1/3161
> How strange, I also proposed such a thing !!

I've just read this paper and have a question.  You state the following:

----
At the moment, December 2009, the following two are the most widely used software packages for building and maintaining institutional repositories according the opendoar website.

http://www.dspace.org with 502 installations.
http://www.eprints.org with 261 installations.

The digital objects and store are located as follows for the above:

• DSpace => $DSPACE_HOME/assetstore
• EPrints => $EPRINTS_HOME/disk0

None of the above use a time/date based file system for storing digital objects. None of them use UUID's to create unique digital
objects and stores.

In one hundred years time how can any of the above satisfy a future researcher that the digital object is unique and has remained persistently so during the years to 2109.
----

Are you able to expand for us your reasoning that repositories that do not use datestamped directories and filenames containing UUIDs will not satisfy future researchers?

Just because a file is stored in that location with a UUID makes it no more or less likely that it has remained unique and persistent.  Filenames alone cannot guarantee this - it is up the repository to manage the integrity of the stored items, and the wider system to ensure that this is the case. This is where the notion of a 'trusted repository' comes into play - the fact the the repository pltform and the system as a whole is trusted to have maintained the integrity of the contents.

[A side note: You'll find a lot of the work that Tim has been leading recently regarding AIPs is of interest in this area. https://wiki.duraspace.org/display/DSPACE/AipBackupRestore ]

Cheers,


Stuart Lewis
IT Innovations Analyst and Developer
Te Tumu Herenga The University of Auckland Library
Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
Ph: +64 (0)9 373 7599 x81928


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Dspace-tech] Scalability issues report, DSpace@Cambridge

Hilton Gibson-2
Hi Stuart

If you read further in the paper.
>>>>
6. Creating authentic digital objects
Now that we have uniquely persistent digital objects the next step would have been be to ascertain their authenticity. Using a
system of digital signatures and time stamping verification this may have been possible.
However, during the last three years it seems there has been a decline in confidence of digital signatures and time stamping in
general. The complexity of the system also seems to have impacted on the decline in usage. The concept of authentic digital objects
has been briefly addressed by :
>>>>

Thanks for taking the time to read it.
Dr Hussien Suleman at UCT can help you further.
See: http://www.husseinsspace.com

Cheers

hg

On 08/10/2010 20:31, Stuart Lewis wrote:
Hi Hilton,

- Assetstore: random structure causes large overhead on filesystem for no real gain
Are you able to expand on the overhead that is caused, and from your profiling, explain how the structure could be improved?  My gut (and uniformed) instinct would be that since asset store reads are completely random depending on the items being viewed at the time, the layout of directories would be irrelevant.  Writes may be slightly less efficient, but since writes only tend to occur once, they are of less consequence.
Apologies for sounding cryptic; I was trying not to be too verbose in the template. :-)

This has mostly to do with back-ups. With about 600,000 files in random directories, it can be hard to find out what files have changed. We implemented an simple asset store structure that stores files by year/month/day. This means we can mirror new files very quickly, and only traverse the entire assetstore every other day to check if files have changed.
See: http://hdl.handle.net/10019.1/3161
How strange, I also proposed such a thing !!
I've just read this paper and have a question.  You state the following:

----
At the moment, December 2009, the following two are the most widely used software packages for building and maintaining institutional repositories according the opendoar website.

•	http://www.dspace.org with 502 installations.
•	http://www.eprints.org with 261 installations. 

The digital objects and store are located as follows for the above:

•	DSpace => $DSPACE_HOME/assetstore
•	EPrints => $EPRINTS_HOME/disk0 

None of the above use a time/date based file system for storing digital objects. None of them use UUID's to create unique digital
objects and stores.

In one hundred years time how can any of the above satisfy a future researcher that the digital object is unique and has remained persistently so during the years to 2109.
----

Are you able to expand for us your reasoning that repositories that do not use datestamped directories and filenames containing UUIDs will not satisfy future researchers?

Just because a file is stored in that location with a UUID makes it no more or less likely that it has remained unique and persistent.  Filenames alone cannot guarantee this - it is up the repository to manage the integrity of the stored items, and the wider system to ensure that this is the case. This is where the notion of a 'trusted repository' comes into play - the fact the the repository pltform and the system as a whole is trusted to have maintained the integrity of the contents.

[A side note: You'll find a lot of the work that Tim has been leading recently regarding AIPs is of interest in this area. https://wiki.duraspace.org/display/DSPACE/AipBackupRestore ]

Cheers,


Stuart Lewis
IT Innovations Analyst and Developer
Te Tumu Herenga The University of Auckland Library
Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
Ph: +64 (0)9 373 7599 x81928


-- 
Hilton Gibson
Systems Administrator
JS Gericke Library
Room 1053
Stellenbosch University
Private Bag X5036
Stellenbosch
7599 
South Africa

Tel: +27 21 808 4100 | Cell: +27 84 646 4758

"Simplicity is the ultimate sophistication"
	Leonardo da Vinci

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Dspace-tech] Scalability issues report, DSpace@Cambridge

Hilton Gibson-2
In reply to this post by Stuart Lewis
Hi Stuart

Also please see: http://wiki.lib.sun.ac.za/index.php/SUNScholar/Digital_Signing

Cheers

hg

On 08/10/2010 20:31, Stuart Lewis wrote:
Hi Hilton,

- Assetstore: random structure causes large overhead on filesystem for no real gain
Are you able to expand on the overhead that is caused, and from your profiling, explain how the structure could be improved?  My gut (and uniformed) instinct would be that since asset store reads are completely random depending on the items being viewed at the time, the layout of directories would be irrelevant.  Writes may be slightly less efficient, but since writes only tend to occur once, they are of less consequence.
Apologies for sounding cryptic; I was trying not to be too verbose in the template. :-)

This has mostly to do with back-ups. With about 600,000 files in random directories, it can be hard to find out what files have changed. We implemented an simple asset store structure that stores files by year/month/day. This means we can mirror new files very quickly, and only traverse the entire assetstore every other day to check if files have changed.
See: http://hdl.handle.net/10019.1/3161
How strange, I also proposed such a thing !!
I've just read this paper and have a question.  You state the following:

----
At the moment, December 2009, the following two are the most widely used software packages for building and maintaining institutional repositories according the opendoar website.

•	http://www.dspace.org with 502 installations.
•	http://www.eprints.org with 261 installations. 

The digital objects and store are located as follows for the above:

•	DSpace => $DSPACE_HOME/assetstore
•	EPrints => $EPRINTS_HOME/disk0 

None of the above use a time/date based file system for storing digital objects. None of them use UUID's to create unique digital
objects and stores.

In one hundred years time how can any of the above satisfy a future researcher that the digital object is unique and has remained persistently so during the years to 2109.
----

Are you able to expand for us your reasoning that repositories that do not use datestamped directories and filenames containing UUIDs will not satisfy future researchers?

Just because a file is stored in that location with a UUID makes it no more or less likely that it has remained unique and persistent.  Filenames alone cannot guarantee this - it is up the repository to manage the integrity of the stored items, and the wider system to ensure that this is the case. This is where the notion of a 'trusted repository' comes into play - the fact the the repository pltform and the system as a whole is trusted to have maintained the integrity of the contents.

[A side note: You'll find a lot of the work that Tim has been leading recently regarding AIPs is of interest in this area. https://wiki.duraspace.org/display/DSPACE/AipBackupRestore ]

Cheers,


Stuart Lewis
IT Innovations Analyst and Developer
Te Tumu Herenga The University of Auckland Library
Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
Ph: +64 (0)9 373 7599 x81928


-- 
Hilton Gibson
Systems Administrator
JS Gericke Library
Room 1053
Stellenbosch University
Private Bag X5036
Stellenbosch
7599 
South Africa

Tel: +27 21 808 4100 | Cell: +27 84 646 4758

"Simplicity is the ultimate sophistication"
	Leonardo da Vinci

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Dspace-tech] Scalability issues report, DSpace@Cambridge

Stuart Lewis
In reply to this post by Tom De Mulder
Hi Tom,

Thanks again for your answers - apologies for following these up with more questions...

>>>    with 16GB of memory and fast local storage
>>>    Java memory: -Xmx2048M -Xms2048M
>> Is there a reason why you only allocate 1/8th of the system memory to the application?  Have you found that adding extra doesn't help?
>
> In our experience, it merely delays when the error occurs, and we'd still need to restart. Whether we do this nightly or every other night doesn't make much difference. I'm not sure it would actually make it go faster. Additionally, we need to keep memory free for file caching and thumbnail generation; we found that if we assign too much memory to Java then the system needs to read from disk more for these other tasks and we get a slow-down there.

Is this a linear relationship between memory and time, or does it start to flatten out over time?


>>> - Assetstore: random structure causes large overhead on filesystem for no real gain
>> Are you able to expand on the overhead that is caused, and from your profiling, explain how the structure could be improved?  My gut (and uniformed) instinct would be that since asset store reads are completely random depending on the items being viewed at the time, the layout of directories would be irrelevant.  Writes may be slightly less efficient, but since writes only tend to occur once, they are of less consequence.  
>
> Apologies for sounding cryptic; I was trying not to be too verbose in the template. :-)
>
> This has mostly to do with back-ups. With about 600,000 files in random directories, it can be hard to find out what files have changed. We implemented an simple asset store structure that stores files by year/month/day. This means we can mirror new files very quickly, and only traverse the entire assetstore every other day to check if files have changed.
>
> Maybe I should expand a bit on our storage set-up:
>
> - our live system has about 90TB capacity, with an EMC SAN connected to a pair of Sun servers. These present them to our private network at about 4Gbps, as well as running the checksums (I wrote some Perl to do this job locally, rather than add to the I/O of the live server.)
>
> - we have two sets of back-up servers (ZFS-based) off-site for the live system, which use rsync to mirror all this data. (Two systems because otherwise, if we lose one, it'd be vulnerable too long while the data is re-sync'ed).
>
> A small script makes copies of the day's assetstore every hour; a complete rsync runs across assetstores (the original one as well as the new one with our own datestamp format) every alternating day, and at week-ends we run rsync with checksums. Essentially this system is copy-on-write: if a file changes on disk, the old back-up copy is moved into a holding area to be deleted when necessary, and the new file copied in its place.

My initial concern with that setup would be the use rsync over such a large amount of storage:  rsync is horrendous for processor consumption, and you have a lot of disk for rsync to chew through in order to detect changed files.  Is there a reason you don't use the in-built ZFS replication facility?  This will presumably be much more efficient as the filesystem itself implicitly knows when to perform replication, and will be quicker and more up-to-date than hourly syncs.

Cheers,


Stuart Lewis
IT Innovations Analyst and Developer
Te Tumu Herenga The University of Auckland Library
Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
Ph: +64 (0)9 373 7599 x81928


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
Loading...