Dr Alice Motes, Senior Data Librarian, University of Bath
Managing digital files is inevitable in research. At some point everyone has some data in digital form they need to wrangle – even if that starts as pen on paper or feet on floorboards (shout out to my performing arts folks!). If we know we’ll be creating digital files over the course of our research, then we can take proactive steps to make our lives a bit easier. That’s where “research data management” comes in. It’s a term that tries to capture all the practices we engage in across the research lifecycle from planning to preservation. It’s the holistic soup to nuts approach to research projects. You’ll most likely hear the phrase “research data management” from libraries, research support, university leadership, and funders. Sometimes it gets used loosely to mean “sharing and preserving data” when really that’s only part of the equation – first you need well-managed and documented data! It’s often closely tied to ideas around open research/open science and reproducibility, which many universities and funders have policies about – for example, here’s University of Surrey’s. (I previously worked at Surrey so many of the links and resources mentioned here are from Surrey; other institutions will have their equivalents; check out the forum for some additional local resources and to share your own knowledge with the wider programme.)
A lot of the advice on good data management practices should be driven by what methods and technology you’re using to collect, document, and analyse your data. We’ll go through some top-level strategies, but you’ll want to investigate what is commonly accepted practice in your field to get more specific advice.
Now, let’s talk about some ways we can better manage our digital files, keep them safe, and share them!
Make a Plan and Get Organised
Data Management Plans
One way to tackle research projects is through a structured semi-formal plan. Data Management Plans (DMPs) aim to provide a rough roadmap for collection, analysis, storage, sharing, and preserving data of research projects. Funders often require them as part of bids. Some universities expect every research project to have one. DMPs can be as brief or detailed as necessary. They’re considered living documents and can be revisited. DMPOnline provides step-by-step guidance for filling out data management plans. While their focus is on templates for different funding bodies, you can also select their generic template to create a DMP for your project. Sitting down and thinking through each stage of your project, what requirements you might have, and what your plan is for sharing and preservation will make achieving those goals at lot easier. We’re going to cover some strategies and tools below that hopefully will make your data planning dreams comes true and could be included in a plan. University of Otago has a Data Management Planning Tool, too.
An easy strategy for keeping things organized and findable is coming up with a simple file naming convention. The goal is to be descriptive enough to know what every file contains without opening it but brief enough that you can quickly read the name and it won’t cause trouble for some programs that don’t like long file names. You’ll want to think ahead to your analysis – what file names make the most sense for your project? Do you want your files to be organized by date, project site, test solution, material, etc? How do you want your files sorting in your folders? You can read more about good practice here. The key to a good file naming convention – or really any data management practice – is finding one that works for you. You can construct an elaborate file naming convention, but if you don’t remember it and you don’t use it, then it doesn’t work!
Figured out how you want to approach file naming? Great! If you’ve already got lots of files or during data collection you end up with a lot of semi-nonsensical file names (I’m looking at you, proprietary laboratory instrument), then you can use a bulk renaming tool to quickly rename large numbers of files. Use caution though! You don’t want to accidentally lose important information when you bulk rename files. If you can, it might be worth creating a back-up before bulk re-naming the files. Once you’ve verified your bulk rename has been successful you can delete the back-up. Some recommended applications:
In academia we’re constantly tweaking, adjusting, and revising. Keeping track of what is the most recent version of the data, code, or paper can be a headache. Add collaborators to the mix and you’ve got even more ways for versions to get out of hand. Luckily, there are some ways you can avoid the major pitfalls of working through multiple versions.
Use built-in functionality – the software you’re using may already have version control measures. Find out and use it!
A simple manual version control can be achieved using numbers. Whole numbers (1, 2, 3) can be used for major changes and decimals for minor changes (1.1, 2.1, 3.2). For collaborative documents you can also include a log at the beginning of the file to record who and what changes have been made to the document and when. You can see an example here.
For software and code development, there’s software that can keep track of the work. Github is a popular free option aimed at open collaboration and sharing of code. Surrey also has its own Gitlab instance if you need the same functionality with additional security. While Github was primarily designed to host code, people use its functionality for all sorts of purposes.
Not sure which Word document is most recent? You can use the built-in “Compare” option on Word to see what the differences are between two documents. Find this under “Review”, then hit “Compare” and then select the two documents to compare. Differences will appear as track changes.
Deletion – One of the best ways to keep versions clear is by deleting files. If you want to keep some earlier versions of work, then only keep major changes and delete any minor change versions. Deleting unnecessary files is some of the best data management you can do.
Want more? I don’t blame you! Check out this great video on data management and version control from Software Carpentry.
One last related thing: if you’ve got messy data you need to tidy up before analysis, then you should really check out OpenRefine. Oh, and try to automate everything as much as possible – put it all in your scripts/code, so you can know with certainty that the data you’re working on is exactly the same data your last analysis session was working on. Here’s an example from Biology for regularly updated data.
Keep Your Data Safe
For those of you who venture out into the world to do field work and the like without consistent access to the internet, then you should regularly sync the working copy of your data with the master copy on University storage. Find a system that works for you: maybe every Monday morning during breakfast you sync your working copy? Or make it a part of your process: after a large chunk of collection or analysis your final step of the day will be syncing the data.
Surrey storage options are outlined here. If you need to move large files (up to 20GB) in or out of Surrey storage you can use Surrey’s dropoff service. When in doubt, don’t hesitate to get in touch with IT Services if you have questions or need help, especially if you have special requirements (e.g. secure bubbles, high performance computing, etc.)
Share Your Data
Making sure your data is neat, tidy, and well documented offers a huge advantage during your project and write-up, and even many years later if you need to revisit the data. It also means that someone can easily use your data even if they weren’t involved in the project. And it’s increasingly likely that other people will be looking at your data! While data sharing is a long-established tradition in some disciplines, it is becoming the norm in many others. Why? Well, people primarily share data for two purposes: 1. verification and reproducibility and 2. re-use. This gets back to ideas around open research/open science and reproducibility. There is a movement in academia towards creating a greater culture of openness and sharing as a standard practice. Communities of scholars are pushing for more transparency and improving research quality within their own fields. Large funders are recognizing research data as a valuable project output full of potential for accelerating scholarship and stimulating entrepreneurship through re-use. Two recent examples: 1) UKRI recently made data availability statements a required element of published works. 2) In the US, it’s now expected that all publicly funded publications AND research be made available without embargo or cost.
We can really see this in dramatic action during public health crises (AHEM Covid-19, I’m looking at you) when openly sharing data leads to rapid advances in understandings and applications. Universities and journals often put in place resources, people (like me!), and policies to encourage openness and reuse. All this rests upon a strong foundation of proper data management and documentation, so someone else can pick up the work where you left off. So how can you get involved? Well there’s a lot of angles to cover, but some things to consider include:
Permission/Consent – If you are using data you don’t own or is commercially sensitive, you’ll need to agreement from the owners to share the data. If you have participants, then you will require consent to share their data. Check out the UK Data Service’s templates for how to write consent forms for data sharing. There are lots of ways in which you can participate in a culture of openness and transparency while also protecting your participants through various levels of controlled access. More on that at the end.
Licenses – You’ll want to apply a license to your data to make it clear how someone can use your data. Most data are shared under CC-0, CC-BY, or CC-BY-NC, but there are specialized licenses for software you may want to consider.
Data repositories – There are lots of purpose-built platforms for sharing research data. Some of them are focused on specific disciplines or types of data, like UK Data Service for social science data or NOMAD for materials data. Others are generalist repositories and will take any type of data, like Zenodo.org or figshare.com. Here is a non-exhaustive list of repositories. Need help identifying a repository? Ask your librarian. In fact, many universities run their own repositories, so you could check with your library first. Surrey’s repository has recently started accepting de-identified data. Email firstname.lastname@example.org for how to deposit.
Open Science Foundation (OSF) – provides a platform to manage your project from beginning to end. You can pre-register your study and decide which parts of your project is made publicly available and when. For EU/UK students: When setting up your OSF, you’ll want to select their European storage server to avoid any conflicts with GDPR.
Software Sustainability Institute – SSI has a wide range of advice on how to make your software and research more sustainable, reproducible, and shareable.
Be sure to consult with your supervisor or PI about data sharing before you release any data! They may know the best place to put it or have some good advice on best practice in the community.
The final thing to remember is that sharing your data is not an all or nothing proposition. You may have some data that can easily be disseminated, some that will require some vetting, and yet more data that can only be made available for verification purposes under a non-disclosure agreement. All of these methods are demonstrating a willingness to engage with openness and mark a different approach to how data is understood and valued in academia – as a valuable output, deserving recognition, on par with a publication. Even if you are in a position where sharing your data is not feasible, imagining a future re-user can help you better organize and document your project for yourself! Trust me, future-you will be very grateful.
A wealth of resources is online to help you better manage, share, and preserve your data. We’ve only just scratched the surface here. (I didn’t even get to talk about preservation!) You will be able to find some discipline or method specific advice – look for recent publications and check to see if your professional societies have any advice. Of course, there’s also lots of great general overviews including:
Local resources will likely include members of your IT Services team and research support offices, like Surrey’s Open Research team housed in the library. You may also have local groups offering advice and training on software or methods. For example, Surrey’s Reproducibility Society meets regularly to discuss reproducibility and has hosted sessions on Github and R. If not, maybe you could start one!
Swansea University’s Research Data Management team also has lots of advice and support links on their webpage and they also run regular training events.
You Got This
Hopefully, this has given you a jumping off point to explore the best practices within your own field. The challenges of enacting good data management habits are distinctly human ones (taking short cuts, relying too heavily on individual memory, choosing to save a little bit of time now that turns into a lot of time backtracking later), but these are not insurmountable. There’s lots of tricks and tools you can use to make your life easier. So, get a plan together for how you want to tackle your data and give it a go!
You may find it enlightening to compare how different fields or institutions handle data – and the number of different forms data can take.
Why not compare your own approach and experience with others in your pod? What can we learn from each other; what else is out there?
We’d also love to hear in the comments below if you have any top tips or cautionary examples to share…