We are back for day two of Repository Fringe 2014!
We start the day with two parallel sessions. In Appleton Tower (M2ABC) we have Muriel Mewissen of EDINA, speaking about the Jisc Publications Router – Delivering Open Access Content to Institutions. In Informatics – and here on the blog – we have:
Unwrapping Digital Preservation, Steph Taylor, ULCC, University of London, Informatics Forum, G.07
Firstly a bit about why I am here, and why I’m talking about preservation. At ULCC I work with my colleague Ed providing training on digital preservation, but before that I was working with repositories.
The first time I heard of repositories as an archive was in the early days of RSP. A lot of repositories used the word “archive” or similar terms, but even then I was concerned about the use of that work as an archivist. Repositories can be a scary word, but these were not spaces doing preservation. As time moved on there was a perception that depositing material would, inevitably, mean it was being preserved. As funders really started backing deposit of papers the job of actually planning and conducting preservation increasingly fall on repository managers. I had quite a lot of phone calls as those changes came in and some of those issues of what preservation really mean are what I want to talk about.
A repository isn’t a digital archive. It may be, but it isn’t neccassarily that. But why do we have this idea of those terms being synonomous somehow? Well the definition of a repository is about depositing things especially “for storage or safe keeping” BUT there is no mention of the long term. That is the difference from an archive, from the practice of preservation. There are some things you might not have. You might not have a preservation plan – you need to define the long term and plan for what is needed to ensure the item is preserved. You need some selected preservation strategies – even with paper archives you need to check materials are safe, conduct some materials. And we need to have preservation strategies for digital materials – we need to ensure that formats are still readable, that files are intact, there are lots of interventions that may be required. At a previous library I worked at, when theses were deposited, we committed to preserve them. We had some theses already sitting with unreadable large form floppy discs – they were obsolete very quickly and we didn’t have the strategy to ensure that was preserved. So strategies could be emulation, switching format, etc. But you need strategies.
You may also not have an archival-quality digital object. It may be about the highest quality file – a much bigger file. But that’s something repositories do not really do. A version designed for the long term. In an archive you might preserve a TIFF file, but also produce a JPEG to enable distribution and access by others. And that can provide real flexibility. University of York have been able to preserve TIFFs for a gallery and, with their permission, make those available to view online.
For preservation you need preservation metadata – every time you access, change, view the item you need to record it. And you need technical metadata constructed with preservation in mind – everything you can gather in order to enable changes in format, ways to open the file, to understand software and hardware shifts.
And I want repository managers to think about access and rights management for the long term. Typically we only think about embargoes as the longest terms. But if we are preserving digital content we need to think 10, 20, more years into the future, to future people in our role. And we always need the permission of the rights owner to preserve and to make copies for preservation of an item. The law has shifted a little but you still need written permission. And copies for preservation are very different to rights to share publicly. But all of these are different to what we are used to doing with our repositories.
You need to do regular, planned checks to see if content is still accessible and can “play” ok. Most digital preservation centres check regularly – every 12 months say – but also at any trigger points like changes in hardware, changes or discontinuations of software. Those checks are essential. Formats that are easy to use and deposit isn’t enough for digital preservation. And the idea of content being accessible over time… we need procedures in place of handle problems of content becoming in-accessible over time. And we have to know what embargoes around data protection, sections that require removal for privacy or security reasons etc. These are very different to journals, publishers, and copyright issues.
There is a lot here. But…
A repository CAN expand to become a digital archive as well. Or maybe you have another system that you pass items onto. But the the system can be expanded, can be reshaped, can be turned into a digital archive. And if you do want to do that start by thinking about how long you want to keep it, for how long, and why. There is more information at DPC, Open Planets, DCC – all of whom provide huge amounts of information and support around digital preservation.
The other big difference between a repository and a digital archive is around selection. You will have some sort of selection process for your repository – may be about who someone is, what they are doing, which project – may be many triggers. But for an archive do you need to keep everything? What is your further selection process? So, for example, if you have many iterations of the same article over time, do you want to keep all of them or just the published (or another) version? Digital preservation can be expensive in terms of kit, in terms of people… it is about the long term and that means long term. So you may want another selection process for preservation. And if you do want to create a digital preservation policy, or use your repository as a preservation archive, I’d recommend talking to colleagues already working in preservation as they will have policies and experience to draw upon.
And if you do want to get into this do look at training courses – we have an interactive one in London (I teach on it so may be biased!), but also you’ll find training courses and information from those organisations like DCC or DPC.
Comment 1: I am OpenDOAR – I sit on the computer looking at everyone’s repository to see if it’s good enough to go on OpenDOAR and looking for policies. Many have metadata etc. policies, but very few have preservation policy. It is crucial. If you don’t have one already DO create one, do make it available, and contact us/add it to your OpenDOAR record.
Steph: Do people have policies in their repositories on preservation?
Comment 2: I’m in the lucky position of being repository manager and university archivist. My initial intention was to set up the repository as a digital archive. For the repository we have a check list of things to agree too… But in terms of the policy… I have it but how do I get it out there, into the university. Things like every time you have a new image on your computer, a new piece of equipment… letting me know so I understand what is needed.
Steph: You are not the first person I’ve seen with that joint role. But yes, it’s hard to get people to tell you this stuff. If you have senior buy in it is easier to get policies out, to mandate that information. Does anyone have a repository they’d like to use as an archive but are not doing that yet?
Quite a lot of hands up
Steph: We are increasingly seeing repository managers on our courses. And if your organisation isn’t yet engaging in digital preservation repositories are a good place to start – there is a body of work there, it is a great place to get started. I also wanted to ask whether you accept file formats knowing that you can make archival copies and maintain them? My own experience was that getting full text in was hard enough… let along worrying about how it was being sent.
Comment: For the repository that we had in the past you could sent anything in. But for the RDM policy there is a defined set of file formats that are encouraged, another set that are acceptable, and another set of formats that are not acceptable.
Steph: That’s really good. It’s tempting to be as flexible as possible, and to take anything in. I’d recommend looking at file formats and seeing what’s good in the area you work in, and then making some choices. A very prominent organisation working in preservation didn’t do this themselves… they took a large amount of data in formats that are hard to take copies from, to maintain over time, they are a bit stuck with it. It’s well worth thinking about what you do and don’t take. Discuss with users, make it workable, write it down, and make it policy. Send back formats that are not preservable – ask them to convert or change the format yourself with their agreement.
Comment: The Library of Congress released file formats in ranked order of which were most preservable. A really good resource to look at.
Comment: And the way to sell to academics is that if you submit in a more common file format, their work will also be readable and accessible to many more people and that’s really important to them.
Steph: Absolutely. More people accessing their work, and for longer, is a huge motivator. But also do explain why this is a preservation challenge, why a format isn’t workable. Don’t take stuff you can’t manage. Talking of which, the final thing I wanted you to think about is whether you know what you are going to keep for the long term? Whatever the long term means to you – some funders specify how long they want materials kept for, some are vague. And do you want to keep everything, forever? I started with a quite domestic idea, to clutch everything tight… but you have to be much more selective to make preservation work. Do you have a selection policy for preservation yet? Or policies on how long things stay in your repository.
Comment: I don’t have policies for that. But much of what we store in a repository is material that is certainly conceptually, if not legally, the property of the academics. I want to engage with them to select materials. They may want to keep everything but there needs to be a mature conversation to that.
Steph: Everyone’s kneejerk reaction is everything for ever, you want your hard work saved… but it may depend on the work they do. Research data may not be useful beyond a certain point for instance. It’s important to engage with users, and to get the institutional view on what should be saved.
Comment: We have an unofficial policy that, as long as people can support it, we will preserve what’s in the repository. But for RDM we have at least 10 years from deposit and and at least 10 years from the last point of access. That may be an SRC thing but it’s certainly a Lancaster thing. But it’s hard to sell to different people. Researchers love it. The information services and infrastructure people see it as a huge cost, they aren’t happy about it. And the library staff don’t have time to maintain everything there. So we need a selection policy… but a very big difference. Senior decision makers think that we have a repository, keep everything for ever.
Comment, Kevin Ashley: For those struggling with selection policy… the DPC, working with colleagues at the Australian National Data Centre, have released guidance on that and we are working on policies. We know that our guidelines have been used to write selection policies across the world. But ironically, as we understand this stuff more and more, we will need to throw away more and more stuff. And we will increasingly make wrong decisions – there is so much data – and you just have to live with it. Nothing is that bad.
Steph: Absolutely. Material will become inaccessible over time sometimes but making your best efforts to preserve at least means we don’t automatically loose these items.
Comment: To what extent should every repository expand to be digital preservation spaces? Or to what extent should digital preservation be collaborative, something service based.
Steph: On a practical and technical side, yes, everyone could use their repository as a preservation space. But there may be reasons not to do that. Organisations pooling resources for preservation can be much more sustainable than individual approaches. It’s a good idea. It’s the kind of thing that lends itself to collaboration. May be lots of talk with users, with other organisations… but it makes sense financially, in terms of number of copies required. Collaboration can be hard, it can be challenging to get the right people at every level engaged… but there are a lot of benefits in it. That would be a good way forward.
Comment: As a follow up… collaborations can come about in different ways… sometimes through funding stimulus, sometimes by locality, sometimes by chance… something that would be useful would be to find ways to initiate collaborations, to get things off the ground. Many think collaboration is a good idea, but they don’t always know how to go about it.
Steph: There is a great set of communities around digital preservation. We don’t have a central body in that way although there is a great deal of work from DCC, many members at DPC. DPC actually now offers a service to facilitate consultancy from one member to another. Sometimes collaborations are geographical, thematic, many criteria… It would be great to have a sort of dating service for digital preservation – to find places to engage well with.
Comment, Steph’s colleague Tim: One of your key messages was that the repository is not a digital archive. But at ULCC we have been integrating EPrints with Archivein?. We have some information on that work over on the registration desk about how we
Comment: Our repository, DataShare, is a DSpace system. Our system does a report and checksum every day. You can check every day automatically that files aren’t corrupting in the database.
Steph: Absolutely. I didn’t want to go into too much detail but yes, if you go away with one thing, think about checksums. Really easy to automate checks that everything is ok.
Comment: Do you have any advice on media. I know about archive quality CDs – you can’t get them everywhere and I know they have a limited life.
Steph: If my colleague Ed was here, he’d jump up and down at the notion of archive quality CDs…
Comment, Kevin Ashley: If you look at lifetime costs of those formats it can be huge, especially if they are more expensive to start with. Buying cheap, switching formats regularly, taking advantage of cheap technology is so much better than buying into expensive current tech and maintaining it. In exceptional cases there are times you need to do that, but it is rare. In the extreme case LOCKSS/CLOCKSS they use inexpensive systems but mass replication that allows for some or many instances to fail. That’s a very inexpensive way to do digital preservation.
Comment: Is it better for in-house repository systems (already in place locally) with check sums, etc. or outsourcing hosting and checking of archives?
Steph: There are pros and cons. With external companies really check out the company, that has a good standing and reputation, that you trust, and that has some sort of guarantee of what happens if things go wrong.
Following a coffee break (with Tunnocks, though none of them were dancing!), we now have a choice of sessions. In Appleton Tower (room 2ABC) there are two sessions from University of Edinburgh staff: at 10.45 we have Ianthe Sutherland on Collections.ed – Launching the University Collections Online; at 11.30 we have Angela Laurins and Dominic Tate running an Open Journal System Workshop. Here in Informatics we will be hearing a longer session on the Jisc Monitor Pilot Project.
Jisc Monitor Pilot Project: an exploration of how a Jisc managed shared service might support institutions in meeting the post-2014 REF Open Access policy, Brian Mitchell & Owen Stephens, Jisc, Informatics Forum, G.07
Brian: The origins are in the Jisc APC project which identified key challenges in the management of OA (see case studies online). So we wanted to build upon research outputs of Jisc APC, work on UK Policies relating to Open Access – HEFCE for instance. And we wanted to explore development of services to help universities monitor policies, including funder policies. Institutions had expressed a need for support and a role for Jisc in monitoring ALL publications – not just Gold and Green, in complying with funder mandates – such as licensing or embargoes, in monitoring spend, and to guide and share best practice.
So our outputs will be functioning prototypes mapped to 4 use cases and released as free and open source software by May 2014, robust user feedback, and a number of other components. Sitting at the heart of those use cases we have the Jisc Monitor use cases around:
- Monitoring all publication activity to ensure compliance with funder mandates
- Monitoring all publications activity to ensure a clear understanding of what has been published
- Standards development to enable efficient data exchange
- Monitoring spend on all items – also looking at invoicing and payment details and whether they can be standardised to interact with other systems such as finance systems.
The project benefits from a really collaborative range of participants and team members that spans experience. Collaboration is key to this project. We are taking a user centred approach to development – everything is shaped around the institutions. We are taking an Agile approach to allow us to be flexible as things change. And the work is open source.
- Use Case 1: Monitoring all publications – for funders and institutions.
- Use Case 2: Compliance – this is about bringing clarity to funder requirements and understanding what compliance means.
- Use Case 3 is around standards development and interoperability.
- Use Case 4 is about monitoring spend. Ideally, if we can provide invoice and payment details in a standardised and consistent way for accurate capture and recording of information. And some transparency required about standard OA charges.
This work has synergies with the Jisc Pathfinder projects, with RIOXX, the Publications Router, SHERPA, etc. You can find out much more on the Monitor blog, there is a timeline on the Jisc Colletions website. And we have a number of webinars and workshops coming up that you would be very welcome to be part of.
The next steps for us is around user consultation and requirements gathering. Follow on webinars focused on Publications Activity and Funder Complicance. Prototypes from use cases will be available in September 2014. Systems interoperability and systems workshops will follow later in the year. Now over to Owen…
Owen: Now, you’ve signed up for a mammoth session here. I’m going to tell you about what we’ve done so far, how you can get involved, and then we want to get you discussing stuff in groups.
So, Brian has already described four strands of activities here. Tracking of publications; assurance of compliance; clarity of charging; interoperability across this information ecosystem. So, what’s our approach to this? We are working in 6 week periods producing working demonstrator software in that time. It’s a very rapid and intensive working period where we try to create working and meaningful stuff. As much as possible we are showing that stuff to you, to the community for feedback. So we are trying to get feedback from the community every two weeks based on wireframes or based on working software.
In the first 6 week period – which we are in now – we are looking at these two aspects: tracking of publication; and assurance of compliance. The interoperability stuff underlines/overarches this. But the APC type stuff will come in in the second period of 6 week periods.
We spent May-June 2014 planning our approach, having initial discussions about data models – what the data looks like and how it might fit together. And we have been synthesising existing input from the community into a Requirements Catalogue. You can find that list of user stories, comes from a lot of existing work in this area. We are that we are coming from work that has been happening over many years, and we sit alongside many other projects. So we bring together stories from APC, from workshops, from Jisc, from Mimas, from report on cost of ownership, publishers workshop in January this year, also work we’d done when we tendered initially. That can be found at: http://bit.ly/monitor-user-stories/. There are 135 stories, each rated either red, green or amber. Green are things we think Monitor could or should address. Amber could be approached but may not be solved. Red are things that are out of scope – we can’t or shouldn’t do anything about. And we put our own explanations against that.
In July we ran a face-to-face workshop (8th July) reviewing that draft set of requirements – looking for gaps, looking at whether the initial ranking was right. We had great feedback there. We hadn’t prioritised tracking Green OA, as opposed to Gold OA, and we hadn’t prioritised non-STEM stuff. And feedback was that non-STEM and Green OA were also high priority. And we then held our first online update on 23rd July, as part of ongoing process. There will be 3 more in next month and a half.
Some of what we reported at that online update included this diagram on the sources of data we have, how we can work with it. These include howopenisit.org (looks for textual descriptions of journal licenses from publisher websites); DOAJ and a database of (enhanced) Journal/article data from that, also working with PubMed via Lookup Article by Journal ISSN. So we are connecting up that combination of article, journal, journal license… We can dream up elegant technical solutions for which there is no data to support the process, we have to be careful of that. And we have to be careful of what data sources we look at. So we are also thinking about the Journal TOCs service (originally from Tic Tocs project), which would be a non STEM source for data. We might also look to Web of Science and Scopus – but let us know what other data sources we should be using if you see gaps there.
So, we’ve connected this stuff up, we’ve created a basic Jisc Monitor UI. It is a basic browse and search facility. We are also working on the PubMed data… We have meetings with SHERPA/FACT planned in August, we have meetings with CrossRef, we have been engaging with publishers and looking at the best way to do this across the board. We have spoken to PLoS, and also the JACS group that includes publishers like Ubiquity, some of the new publishers.
Looking at our very basic UI we can search for articles, we can filter by country of publication, by license, by subject classification, by Language Code of Content, by Provider, by Deposit Policy. Of course the Jisc Publications Router is work we will build on in terms of affiliation of publication. So this is a baseline for published stuff. This is material that is published, with declared licenses. But can material in press be included, can we get data from publishers at the point of submission. That would be useful but there are sensitivities there… not all academics want to advertise where and when they are submitting, and when their work is rejected. But if we could capture metadata/identifier very early on that would be very useful. So that’s a challenge for us.
So, how can you tell us what you think of what we are doing? We have online sessions on Wednesdays from 10am to 11am on 6 August, 20 August, 3 September. If you wish to contact us, email Frank Manista (firstname.lastname@example.org – though that email address may change, contact Brian if that one doesn’t work). We also have a face to face workshop in London on 19 September, and that will look forward to future work on charging, APCs, etc. (http://bit.ly/jisc-monitor-workshop-2).
But you don’t have to wait until then… You can participate now! We’d like you to think about several questions today:
- What are the key local systems for us to interact with? – what systems are you using, are relevant, what is your local set up?
- What data do these systems store?
- What data formats/data models are used?
So we will ask you to break into groups, take a sheet of paper each, and in that group come up with an area of data you think you have – could be publications data, charging data, whatever – and then write down everything involved in terms of systems, data formats or data models… whatever is relevant to you.
We also want to ask you:
- Monitoring of institutional mandates etc – should we be doing this? And what would we need to do?
- What kinds of institutional “mandates” are in place? (how do these differ from other mandates.
- What kind of compliance and monitoring already takes place? Is this for Jisc Monitor to do?
And do email me: email@example.com with any thoughts on:
- Data sources we should consider for article data – what do you use already? What do you need?
- Data sources we should consider for licensing/terms data
- What data is relevant as regarding licensing/terms?
Q1: Is there a presumption that local institutions will make their systems and data available to the Jisc Monitor?
A1: Not necessarily. It may be about pushing out rather than pulling out. But it may also be about making recommendations about how you would connect payment data to finance systems – rather than a requirement or expectation of that being made. As well as discussions with publishers about how they could support that. It’s not all about direct engagement with data etc. It’s about getting that picture as far as we can at the moment, so we can act on that.
Owen: OK, so now the difficult bit… I am going to suggest that you work in groups of six, go towards the back of the room, to discuss those three first questions on your local systems, data used by those systems, and data formats and data models that they use.
Feedback from groups
Chris, Hull from the large group: We talked about making a link between the funder and the publication. Not an easy thing to capture. People felt that they have both funder and publication data, but separately. But capturing that data and it’s relationship in a workflow would be useful. But some data may sit with publisher… Research Fish(?) has that information, useful to institutions, but issue that it covers many but not all funders right now.
There was a desire to… recognise this as a problem. That we have to get academics to make that link as they are the best placed to do that.
Owen: And is there variety of what is done there?
Chris: Not sure we got that far but seems to be Ad Hoc.
Rapporteur, group at left back: Publications data is all over the place. In repositories – DSpace and Hydra; CRIS systems – PURE and in-house data; we have institutional web pages that sometimes come from repositories, sometimes from schools or individuals. Some use Primo as a library catalogue – which picks up publications like theses. In some repositories we have articles, exhibitions, etheses, OERs, RDM metadata, research data sets in some places. In CRIS we have grant agreements, funder information, funder terms (sometimes patchy). Websites sits in several formats and locations, sometimes rogue individuals maintain own papers. And we had formats including CERIF.
Owen: Did any of you use publications tools like Symplectic.
Owen: OK, next group…
George, Strathclyde: We were looking at local systems. We have really wide varieties of systems or data formats. Some use Jisc APC service, some register APC payments on a CRIS – my own institution uses 4 systems. Some have bespoke systems. And other stakeholders – e.g. research office – duplicate some work and systems. But we really talked about APC process, really Ad Hoc approaches taken. Jisc APC process simple in theory… but we aren’t sure of status of that service. A while back there weren’t enough publishers to make it a useful service for institutions – that’s why my institution uses so many systems. And there were concerns about compliance. And a need for greater buy in from publishers for standard systems.
Pablo, final group: We were looking at funder information as well, trying to answer the question – how to get funder information at an earlier publication. Our group had PURE, EPrints, several content management systems – some being upgraded at the moment. Highly fractured picture in terms of systems and processes. All people in the group used spreadsheets – either spreadsheets or Google Docs. And these are not standardised. There were also grant holding problems. Institutional grant codes are not always the same as those provided by funders. And also a lack of communication between research officers, repository managers and librarians. And we then moved into motivation. What is the incentive for researchers here? Seems to be publishers, the area the information will finally come from. But we wondered about information managers. Submitting data at publication time. And a role for funders to offer training and best practice information to code information. In terms of article submission time… too many of them… a common feature across different manuscript submission systems which is ORCID… might be a way in there…
Owen: That’s great. We have to conclude now. I would say that issue of capturing information at the point of submission keeps arising. Thank you so much for your participation today. Do follow the project – we will be blogging, and we encourage you to be part of our webinars and workshops.
I work at Evidence Base and was previously a subject librarian. I’m going to talk about the IRUS-UK project. This was funded by Jisc and the role of Evidence Base in this project was in user engagement and evaluation. This project was an outcome of the PIRUS2 project, which looked at whether it was possible to combine publisher and repository usage statistics. But at the time publishers were reluctant to do that. So there was a desire for repositories to work together to combine usage statistics, using the COUNTER standard, in a way that would later be combinable with publisher statistics when they were more ready to come onboard.
We sit in the wider Jisc landscape, under usage statistics, and are doing so in a way interoperable with those other Jisc projects. For IRUS-UK our aims and objectives were to collect raw usage data from UK IRs for all item types within repositories – not just articles, and about downloads not record views. To process that data into COUNTER-compliant statistics. And to return those statistics back to the repository owners.
We considered two scenarios to gather data. A Push “tracker” code, or a Pull technique using OAI-PMH harvesting. We decided to go for a Push method as it would be easier, minimise data pushed. And we created patches for DSpace, plugins for EPrints, and last month we welcomed our first Fedora user. The ingest process is thoroughly documented on the IRUS website. The key aspect is that we applied a code of practice to filter out robots and double clicks, we’ve also tried to remove noise including user agents, overactive IPs etc. But that’s a minimum level. We commissioned “Information Power” to investigate and report that process. They analysed raw data since July 2012. But they found that suspicious behaviour can’t necessarily be judged on the basis of one day’s usage – but it can be almost impossible to distinguish non genuine activity. So we have amended our processes to improve ingest as a result of these comments.
So, here is the interface. You access this through Shibboleth or Open Access. We can look at totals to date, we can drill down into the participating repositories’ data. The total numbers have to be understood as reflecting length of time in IRUS, not just total number of downloads. We can also drill down the repository ItemType – we map repository types to IRUS ItemTypes. But we retain that metadata so if the mapping changes we can do that in the future and retrospectively.
Another statistic we can look at is DOI Summary statistics – how many, by item types, items that have been downloaded has a DOI in them. We can also look at Article DOIs by repository – a great report for repository managers to look at, to see the coverage of DOIs for repository items. You can also do a search of IRUS-UK by title, author, keyword. This allows you to see which repository or repositories the article is in, and the number of downloads. This could be particularly useful for researchers to track usage, say after a talk. Because so many people are interested in the ingest process and filtering process, we also include Ingest Summary Statistics. We record the repository, the RawDataIn count, the COUNTER robots – what they choose to include, IRUS-UK Exclusions measyre those removed. And then you can see DoubleClicks and FilteredOut totals for each repository.
Now onto reports specific to a particular repository. As a repository manager you are likely to be most interested in. We are working with our community advisory group as we want to be very user led, we are reviewing our reports. Right now they are called Irus Report 1, 2 or 3, but those names may change. The functionality will stay though. So, IR1 report can be used for Sconul reporting, looks at article downloads. IR2 report is about item types, performance of different times. And we have the ETD1 report, we did quite a lot of work to capture usage of dissertations and theses. You can also look at item type. We are currently working on ETHoS integration at the moment, this will combine ETHoS and repository data, to allow tracking of usage and downloads. We are working on this with the British Library at the moment, but this is the same system/approach we could use with publishers.
Repository Report 1 gives you month by month downloads by repository, to enable benchmarking. Similarly you can do this by article type (Repository Report 2) or by Jisc Band (Repository Report 3), or by country/region (Repository Report 4). We also have a proof of concept, CAR1 Report, for consolidated article report – where repository and publisher data can be combined. At the demonstration stage (only) at the moment.
So that’s the reporting within IRUS-UK. But this project is all about community. There are a growing number of repositories participating in IRUS-UK – currenty 64 repositories. We communication through the IRUS-UK Mailing List; @IRUSNEWS Twitter account; IRUS-UK newsletter – you can subscribe or use online; and IRUS-UK webinars. We gather feedback from participating repositories via surveys and conversations. And we have a Community Advisory Group who provide feedback to the IRUS-UK project, they try new reports, we meet with them (virtually) regularly.
We conducted an IRUS-UK user survey in 2014. 68% reported that IRUS-UK has improved statistical reporting; 66% reported that IRUS-UK saves time collecting statistics; even those who used other statistics felt that IRUS-UK had enabled new or improved reporting. 86% reported using it for benchmarking, the rest weren’t sure if they would – no one said they did not intend to use it in benchmarking. The respondents liked that it is COUNTER compliant, that it is easy to set up and use.
So, if you are not currently a member, please do get in touch and we will let you know if we currently support your software, and what the next stage would be. You can contact us via firstname.lastname@example.org or take a look at our website: http://irus.mimas.ac.uk/
Q1: We have a CRIS system and web portal. Can you add this system to those tools?
A1: We allow it, but we don’t currently have plugins for CRIS systems. Some institutions have mirrored repositories, every set up is different really. If you CRIS system integrates with EPrints, DSpace and Fedora then you can use it, but if you don’t we can’t yet support you. But we are investigating options.
Q2: You mentioned Google Analytics. I cache a lot of information overnight… I want to understand the user journey around the repository, the website, and back and forth. Has anyone yet looked at using IRUS-UK stats to improve the user experience?
A2: Not yet, but that would be really interesting.
Q3: Something that came in at Claire’s DSpace session yesterday is that we need to get DSpace into the codebase – it’s a patch at present but it’s buggy. So is this sustainable?
A3: We do try to work with DSpace and EPrints to make sure we are keeping up with new versions, but there is a bit of delay there sometimes as we do work independently.
Q4: We run a data depository… I think you measure one item per download… but for some items we have a number of different files.
A4: Right now we register each download, so we would record multiple download for that item. We also have a similar issue for versions… so if there are pre or post publication versions… it can be in separate records, can be a replacement of the file… so many approaches… So we are looking at how to deal with that best, what people would want in their statistics.
Q4: I think that means it would take a while for us to use for benchmarking…
A4: Yes, well right now we are focused on institutional repositories, publication outputs. For teaching and learning repositories there is very different usage, for data repositories there are also very different usage patterns. They are currently out of scope for IRUS-UK at present because of that issue with usage styles and, therefore, benchmarking.
I’m going to be presenting on another of the Open Access through Pathfinder project. You will see some similarities between our project and others that presented yesterday. We’ve also been up and running for only two months now so much of this will be about what we plan to do.
The O2OA project aims to establish shared institutional processes for facilitating, promoting and managing open access. Very much driven by HEFCE and RCUK requirements. We are a consortium of three post-92 “modern” universities – Coventry, De Montford, and University of Northampton. We are all at different stages in implementing OA policy and services. And we need to embed OA in our existing workflow, without additional resources.
So Coventry are at an early stages in implementation of an OA culture across the institution. Host repository CURVE, using EQUELLA, They have specialist expertise in research impact management – and ran a previous Jisc project. They are taking the lead on research impact management. De Montford University, Leicester have an existing repository and CRIS, they are leading on CRIS. And at Northampton we have a digital repository, NECTAR, using EPrints. We have focused on research data management recently and are currently developing policy and process for OA publishing so will take project lead on OA to publications. We have an internal project partner – the Institute of Health and Wellbeing – because I wanted them to have a stake in the work, for there to be genuine researcher voices in the project. So, our team overall brings together many perspectives which should be really interesting.
There are some unique features for O2OA: we are all midlands based and all post-92, we include business development link to the OA agenda and applied research. We have a real focus on impact. And this is very much a non technical policy. In terms of benefits for the sector, we will be providing a consolidated review of the OA needs of academics, information managers, research support staff, corporate leads and external funders. We will provide an understanding of perceived and actual relationships between OA publications, OA data and impact. A translation of OA needs into associated workflows, and we should inform on methods to adapt repository systems and address interoperability issues. We will be creating case studies and really engaging with what is needed for behaviour change – one of our participants has a psychology background and experience in behaviour change best practice that will be part of this.
So, to date we had a presentation to the Jisc Open Access Good Practice Workshop (17th June). We have a project plan. We have a project meeting at Coventry. We are working on focus groups, a survey is under discussion and a review of OA guidelines. And that’s where we are right now. And thank you to our funders: Jisc, SCONUL, UK ARMA, RLUK.
Q1: A question about your psychologist: is she an academic or was this about using domain expertise of library staff?
A1: She is an academic, by training and by bent. So she is bringing her theoretical and academic knowledge here. She has blogged about her work, she talks about planned behaviour as an approach for instance. Understanding people’s intentions and actions will be really valuable for us.
HHuLO Access – Hull, Huddersfield and Lincoln explore open access good practice, Chris Awre, University of Hull, Informatics Forum, G.07
Imagine yourself on a Hawaiian island…
We are a project which, like Miggie’s project, has a regional dimension. But there is a common thing that brought us together: a desire to find out how open access could support research development, in terms of research as an organisational strategy. Each of our institutions does research, at different levels, with different approaches. But the dissemination of research seems to have little connection back to organisational research strategy. That’s partly as dissemination is often down to the individual to do. But funders, policies, talk about dissemination and impact having a role in research strategy. So we are looking at this from a research strategy perspective. So the aim is to engage with researchers in our institutions. and we want to engage with research policy process, to embed OA, where it isn’t already embedded. There is recognition of the value and necessity of open access in institutional policy but… it feels like someone told us to do it, or “the library was very insistent and made it’s case well”, but less clear that there is an understanding of strategic benefits.
So we need to establish a baseline, and see what changes, what impact our project can have. There is another project in the programme in Northampton doing similar so we are keen to share experience and present a more combined message in terms of dissemination out of the (pathfinder) projects.
We have 6 objectives:
- To establish a baseline starting point
- to communicate the policy landscape internally ad understand research local strategy/policy – and not just throwing policies at our researchers, helping them understand why those policies are there and matter.
- In that context, to review and define options for open access service development
- To enhance local systems to serve OA needs and embed external services – and part of that Jisc Monitor discussion seems useful.
- To monitor the relationship between OA and research developments within the institutions
- To report and reflect work to the community
We are being quite technical in our approach. Two of our participants use EPrints – one hosted, one locally hosted – whilst we at Hull use Hydra with Open Fedora. We need to make sure all are fit to work with those elements that are relevant, including Jisc Monitor. A significant part of this project is about getting to develop relationships and better understand our different research environments better, to get to know our open access environments better. To build links. And we will be working with Jisc Monitor, IRUS-UK and the British Library on license rights. The BL has it’s own issues around theses – many institutions do not express the license for theses and they are keen to find a standard way to get them to express that, so we will be looking at how that could be facilitated (but recognising existing Jisc Monitor work there too).
Those objectives are effectively the work packages. Hull leads 1 and 5, Lincoln leads 2, Huddersfield leads 3 and combinations of us lead the other workpackages. We have a website and twitter account so do engage with us and track the project there. So, from now until September we are undertaking our baseline and planning work, from now until December we will be investigating and collaborating with services outside universities with relevance to open access, and looking at OA services that best meet research and policy needs. So the idea is we get up and running in the first year. Then, in April 2015 we want to run an event reflecting our work back to you, the community, and to see what else would be useful, what other areas we should be looking at in the second year of the project. This project just bring together specific innovation projects at each institution. We will be monitoring and communicating policy as it evolves – although a pause in that evolution would be useful to pause and reflect.
So we hope to demonstrate the link between OA and research strategy and development, and the development of tools and services that facilitate these aspects.
Q1: Can you share some of your ideas about OA and research and impact, what indicators will you be looking for?
A1: Quite often research planning takes place. Research strategy – and institutional KPIs – really focus on increasing research income. So you take that statement and you turn it into staffing, resource etc. We have an office geared up to supporting that KPI. In many ways it strikes me that that in itself has the potential to increase the research development of the place, as a place to do research. You can fund as much research as you like, but without understanding, without exploiting, without understanding the impact of what comes out of that research in your strategy, it’s not necessarily useful. If we can show that link between research outputs and dissemination with OA as part fo it, and how that can feed into research planning that will be really useful. It’s about seeing the institution as a place to do research, not necessarily as a place to attract research income.
Preparing for the UK Research Data Registry and Discovery Service, Alex Ball, University of Bath, Informatics Forum, G.07
I am part of the team developing the UK Research Data Registry and Discovery Service, here at the DCC, on behalf of Jisc. It’s not developed yet but I’ll talk about what we hope to achieve, what has been done so far, and what you, as repositories, can do to prepare for it. Most of us on the team are from the DCC, but we also have participation from the UK Data Archive.
So our service is a bit of a Ronseal – it does what’s on the tin! So, it’s about UK Research Data. Not about e-prints or OER, or Universities’ administrative data. It’s about collections of evidence that underline written scholarly outputs. If we are interested in those other aspects, it’s only in relation to that central form of data. For some institutions there will be data archives, for some we will be needing to extract the research data aspect from the larger repository.
It is worth noting that it is a Discovery Service for UK Research Data – we won’t hold all of that data, we will be about discovering it. We can’t require researchers to inspect all of the data repositories individually, so it will remove that barrier. And our hope is for data sharing and data reuse – with appropriate credit for researchers, better impact and value for the funder, and a better record and evidence basis for that research.
Now, we are not the first people to think of this. But none of these have the same focus we do. We think we will slot in and complement the existing landscape. We will be collecting data from repositories, unifying, aggregating, and eventually we’d like to make those records available in other places may look, including search engines, international aggregators, other UK registries – including RCUK Research Gateway.
We have now completed phase 1, we built a registry based on ORCA software – used in Research Data Australia. We knew it was working well there and we had experience of making that software more portable. We think it works well as it’s search engine friendly – works well with Google for instance – and provides citation and rights information up front, promoting the idea of research data being first class scholarly output. For this pilot phase we needed volunteers to help us understand their requirements. We had 9 universities engaged, alongside UKDA, Archeology Data Service, 7 NERC Data Centres (via catalogue service).
The ORCA software is based around the Australian RIF-CS format, so we created Crosswalks for standards already in use by those repositories: DataCite, DDI Codebook, EPrints (native and ReCollect), MODS, OAI-PMH Dublin Core, UK Gemini 2 (used by NERC data catalogue service in NERC, data.gov.uk, compatible with ISO standard used internationally in the EU INSPIRE directive).
So, we created a pilot. It sits at http://rdrds.cloudapp.net/. Now before you go there, be aware that you won’t see much… the records automatically sit in a holding zone until they are made public – and I’m not sure any of the pilot data has been made public yet. But if we look at an example we see the metadata, a sample citation, identifiers, additional metadata. This metadata schema here doesn’t include everything so that link to further metadata in the source record is essential for researchers needing more information on e.g. use. Access rights are shown, records are connected to people, you can look at subject terms – and navigate to similar content through those. And you can also see Suggested links – both within the registry, and outside (in DataCite). Spatial coverage and tags are also shown. So it’s a start…
But we need to clarify use cases and workflows. We had a good number of use cases but we’ve had a recent workshop we are still looking back at. We also want to compare different possible platforms for the service and assess their suitability – we want the best possible system. We want to establish a working instance of the system, involving all UK data centres and university data repositories if we can. We also want to establish a simple workflow for adding new data sources and adapt to changes in the existing data sources to avoid duplication – whether merging or ranking/preferring one record over another. We also need to test usability – for end users but also for local administrators. Finally some important documentation aspects – recommendations for quality and standardisation of metadata records. And we need to evaluate the costs and benefist of the system.
So we have a long way to go, and two years to get there. We are already being asked what this means for repositories. We think it comes down to metadata, syndication and participation – please do get involved in this participation phase.
In terms of metadata our pilot work suggest we need fields: title; description/abstract; dataset identifier – for de-duplication management so needs to be persistent and global if it can be; subject – for that click through functionality and recommendations, we found many repositories using RCUK classification; URL of landing page – a discovery service so you need access to additional metadata and the data itself, but should be derivable from dataset identifier; Creator (+ID) – some issues in the pilot phase around consistency of format so ID would be particularly useful too, especially for avoiding duplication or fracturing of records; release date; rights information; spatial coverage – particularly important and needs to be a nice structured format; temporal coverage; publisher. Many of these elements are required by DataCite and include all the elements needed there – so a way to kill two birds with one stone.
In terms of syndication we want to do this through OAI-PMH, CSW, Atom/RSS – useful for Sword perhaps; Other XML export over HTTP. Whatever we do we need to separate data sets from the other stuff you have. If you have a separate data repository that will be easy. But if not then we need the type “dataset”, or we can draw on specific sets or collections (actually what we do with DataCite). And the more metadata detail, the better.
There are several ways to get involved. You can let us know about issues that concern you – in phase one a contributor was keen to stress the importance of author names in the citation – we have therefore ensured that we have encoded that information in the XML, to preserve that level of detail. If you have data records do let us know so that we can have a look, try to include them. But we’d like you to do more – test the evolving system for us by setting up and updating an account on the system; harvest your metadata into the system and check it; see if we handle duplicates (and non duplicated) correctly; see how your records look on the system; see how easy they are to find; and measure the visibility of your datasets before and after inclusion.
Do get involved, and do track project progress at: http://www.dcc.ac.uk/projects/research-data-registry-pilot/.
Q1: Is there a way to see the software used to create the metadata. And I was also wondering about the spatial data issue and shared equipment.
A1: Hasn’t occurred so far but I can see the shared equipment data raising that use case.?
Q2: Are you looking to expand the metadata that you have?
A2: We are right at the beginning so still deciding some aspects. ORCA has a set number of fields. We could ask for additional fields to be added. We can make relationships to generalised objects. The metadata scheme you mentioned, we did look at. And in another system we could adopt the DataCite schema. There are lots of possibilities.
Comment, Kevin: It’s worth emphasising that one of our main guiding principles is to require as little additional work as possible from participants, unless it delivers some use for them. So whilst we can think of lots of metadata that might help, right now we don’t have evidence of how that metadata might benefit discovery. We want best value from what metadata exists, and only then think of what more could be added?
Q3: On the issue of metadata, the first thing for a service is to see what metadata people need now, rather than clutter it up with more. But at the same time I wonder if including what funder is behind the work is a think that you have thought about that as a metadata field to include. As well as owner of that data – and reuse around that. Because being able to search across data repositories as a funder would be very useful.
A3: On the point of owners, a point of contact for the data might be less controversial and fits the ISO standard we are working with. We did consider funder in phase 1. We had an issue with the RF-CS schema was around that identifier. There if you have a funder, you need a project identifier. The data we ingested sometimes had both elements of data, but not all had a project identifier. There are ways to overcome that though, as they have in Australia.
I’d like to talk about ArchiveSpace, our new Archive management tool. And firstly to tell you a bit about archives at Edinburgh. They sit within the Centre for Research Collections. We are categorised under Special Collections and Archives. We have a wide and varied selection of archives – the Laing Archive – who had a weird and vast archive!; ECA archive- a highlight being the contract and life portrait of a young Sean Connery; Carmichael Watson archive, Roslin Archive; NHS Lothian Archive; Godfrey Thompson Psychology Archive.
So the current management uses the ISAD(G) standard, the EAD(XML) Schema. These are as laid down by the GASHE and NAHSTE projects. As happens so often the commercial systems don’t always do what you want, so you end up with bespoke options. Recently a system called CMSyst which was built by Grant Butters, an archivist developers hate – great on both the archives and the technology! The system is robust in terms of access and authority control but built with MySQL/PHP. And we also have Data feeds Archives Hub, ARCHON, CW Site (EDINA).
We had a new archivist coming in, so we wanted to pick a system. DSpace was an option, but not all great. Vernon was great for clear objects, no way it could cope with boxes of stuff, collections within collections. We looked at too commercial systems Calm and Adlib – now merged – not good enough to justify the cost. Then we looked at Archivists Toolkit – brilliant… but no front end! ICA Atom looked good, but nothing like the functionality we needed. And so that took us to Archives Space – which has the right mix for us.
Now I was being a bit dismissive about Archivists’ Toolkit, it was the predecessor to ArchivesSpace and trialled extensively, structured to work with EAD, ISAD(G) standards, MySQL database, Good authorities control. But no front end – which meant we needed to wait or build one. Hence ArchivesSpace.
ArchivesSpace has all archivist functionality in one place, it’s web delivered, it’s Open Source, and built by the community – Lyrasis. It is a MySQL database, running under Web Server (often Jetty, we use Tomcat); Code developed in JRuby, available through GitHub, and has four web apps for search, management etc. We had all the data from CMSyst and we exported in EAD. There were a few issues but more or less there now. We also have the functionality in ArchivesSpace to link out to illustrative digital objects. We already have the Luna system for a number of more random items, often separate from collections. So we are and will be making that link.
The system is really popular in America. We are the first European institution to take it. There are Charter Members (54) – they are the executive. And there are 100+ General Members who can contribute – including Aukland and Hong Kong now. So significant and large are the archives using the system that we are almost second tier in terms of collection size, it’s a really significant uptake.
My colleague Ianthe was talking about collections.ed earlier. ArchivesSpace will effectively be the expression of archives within. It needs to be branded seamlessly through the CSS – once all set may work better in ArchiveSpace, need to surface Archives as CLDs. We have had to set up cross walks and deal with duplications. There are 1000s of collections but we think in the next few months we’ll see it all working together.
This has been a really collaborative project with curators, archivists, projects and innovations staff and digital developers all involved. Good training delivered to archivists, and there is a manual created collaboratively by the teams. And good collaboration with other institutions, and interns here.
So, next steps… colleagues will be at conferences in Madrid and Newcastle. And there is a sister application for museums called CollectionSpace as a possibility for our museums. But we wanted to mention it today because we are really happy with it. Do let any archivists you know, looking for a system, to get in touch with us.
You will find more via our collections portal: http://collections.ed.ac.uk/
Q1: Is this intended to run as a catalogue or, as you begin to archive digital artefacts, will it also store those, or will they be elsewhere like Luna for images.
A1: It is a flexible module but at the moment it’s focused on the catalogue/management aspects. But I know Grant Butters is thinking about digital artefacts.
A2: Everything is effectively publicly available. ?
Q3: Has there been much consideration of making metadata available through other discovery services?
A3: The Archives Hub has been talked about so far. That’s the main place for archives across Britain at the moment.
Q3: I’ve seen metadata from Archives feeding into institutional searches.
Ianthe: We have two awesome entries to the developer challenge. We will hear from “Are We There Yet” and “Repository Linker”. Then we shall cheer and see who takes what prizes!
Are We There Yetttt? – Miggie Picton, Marta Riberiro, Adam Field
Miggie: This was designed to solve a problem in multiple repositories, because of HEFCE’s requirement for authors to log a paper at submission, before accepted for publication. So I asked for a tool to alert us when a tool sent in for publication is actually published. So that was what I asked for!
Magda: The user is presented with a form for title, author, and an email. And when they submit… the system goes searching…
Adam: We decided that once something has a DOI, we can be confident that it has been published. So we take the metadata, send to CrossRef, and search for suitable DOIs, and seek out exact match for title and creator. If we see that, we grab date first seen, and alert the user that something was found at this DOI. Then librarian’s responsibility to check that. Obviously this system would benefit from more metadata but we’ve kept it simple…
Magda: So when results are in you can click through to view the record. You can then mark that DOI as correct, or as incorrect. If it is correct the date is logged. You can have more than one correct DOI.
Adam: The other feature is that, whilst you are waiting to find the paper you get the message “we’ll get there when we get there!!!”
Richard: We took a while establishing our idea. We came up with a tool to check the completeness of your repository records, and help you fill in the gaps. Through a web service and a plugin,
Paul: Part of this is a webservice for entering a record for the repository… it will spot gaps, but suggest replacements from services like SHERPA, CROSS-REF. To demonstrate this John? has created a demonstrator integration for EPrints.
John?: So you take a record… it sends the data to Paul’s service, looks at where the gaps in the metadate are. It suggests replacements to fill those gaps. As you confirm details it filters down suggestions so that data is gradually filled in. So, imagine you’ve been handed only a small piece of paper… no other metadata. Hmm… but there is a DOI that seems to match… lets have a look… who are the authors, who are the funders… So we are almost there… And, in theory… if we try to add another author, new projects will be suggested based on that author. But that’s just a tiny part of what you could do for your repository – more powerful options there, imagine the possibility!
Ianthe: We have prizes thanks to our lovely sponsor the Software Sustainability Institute, and Neil Chue Hong is going to quickly tell us a bit about what they do…
Neil: The Software Sustainability Institute want better code available for researchers. We engage with events like Repository Fringe because we think code expands beyond software engineering. A couple of reasons… we want to support people like these developers, to develop and support them. But also because we have a need to have preservation software – particularly when linking things to papers. We want to arrange a hack for one-click solutions to archive software to repositories like DSpace and EPrints.
Cue many cheers for both! But the first prize goes to Repository Linter! But prizes for both from our lovely sponsor the Software Sustainability Institute.
CLOSING REMARKS, Peter Burnhill, EDINA, Informatics Forum, G.07
EDINA are one of a triumvirate of Information Services organisations – with DCC and the University of Edinburgh – behind this event. And I’d like to start by thanking all of you for coming along, for speaking, and I’m sure we’d like to thank all of those who made today’s event possible. We are about innovation and creativity, as in the presentations we’ve just had.
So, looking over the last few days… yesterday was rather dominated by open access… that is part of the threat of the REF I suspect. But almost 50% of you are new to the Repository Fringe, there is change, our group is broadening out beyond early adopters a few years back. When other colleagues across the world talk about the repository, they are not as obsessed by OA as we are, they think beyond Gold, Green, the Finch Report… those are important for adoption but they are not everything. So to really realise the potential of repositories there is a real need to understand the importance of research information management – and I think that penny has dropped – and bounced a few times too!
We now have connections between the four separate areas – the publishers, the library and related areas of institutions, research awards and funders, and the academics and authors. I think both of those Challenges we saw this afternoon illustrated many of the challenges still here. The issue of incomplete, partial metadata and the need to get the best that you can, and select the best available data. And the first presentation was so important for illustrating what I call the “celebration of the purchase order”. So, now we have Gold, we are supposed to buy into a service – we should issue a purchase order for that service. But you don’t pay before the job is done. That first Challenge was about logging that purchase order in a way, and that DOI returned letting you pay your bill (look out for that bestseller novel: “The PI Worrier and the Bag of Gold”).
Day two has been about preservation – a flood of tweets there. On research data. and the idea of special collections – these other aspects of repositories and concerns for the long term. If we look back to OR2014 you can see that these ideas are definitely knocking about more widely.
Anyway, what we do is about ease and continuity of access. Repositories support that endeavour. Both elements are important. And it’s not just the PDF of the research article… It is the aim to ensure that RepoFringe continues to celebrate and support the developer but always with an eye on policy-purpose.
Big thanks to all who came from near and far, to all the organisers for today.
Martin Donnelly, DCC: So that’s it for this year. We want this to be a great event, and to make next year a great event we need your feedback, your comments, your ideas on how to make Repository Fringe 2015 even better! So email us, comment on the blog, tweet us! And finally I’d like to particularly thank Dominic Tate (UoE), Laura ? (UoE), And Lorna Brown (DCC) for making the last two days happen.
Thank you to all of you who came along, who have been following on the blog and on Twitter. If you have any feedback for us do comment here, email us, or tweet us… but there is also an official survey that you can complete here. And, keep an eye on the blog for follow up posts, links to your posts about the event, and we’ll try and add our pictures into these liveblogs too!