Got Data???

Feb 28 2011 Published by under Data

Actually you've probably got too much and you probably don't have the time to sift through it, let alone store it in an organized fashion for perpetuity.  Data management is (at least for some of us) becoming almost as cumbersome as getting the damn stuff to begin with.  It would be great if we had a bioinformatician chained next to us at the bench so they could be analyzing and processing the data as we are creating it in real time.  And a few labs even do, but they are the serious outliers.  At best some of us are crunching our own numbers but the rest of us poor saps are having to hand off our data to others for this task and wait in queue.

Next where are we storing these vast quantities of digital data?  Are your universities and institutions setting up and running secure servers so that you data is at least protected and accessible to you let alone the public?  Do you have a mini-server or a more complicated RAID5 setup humming away in the corner of your lab?  Or are you the poor bastard whose data is only on the hard drive of your desktop that is infrequently if ever backed up?

I think a huge problem that is on the horizon, especially with doing research that is data intensive is:

-Where are we storing it?  Should we be dumping it all into government repositories like some folks stash their microarray data in GEO or Gene Expression Atlas.

-Should we be using a standardized format for data that we store?  Excel, ASCII, comma versus tab separated values?  Ugh!

-Do we do our own bioinformatics or just farm it out?  Before the rate limiting factor of some my research was doing the experiments.  Now its data processing and analysis.  A recent survey in Science showed that only 24% of respondents had the capability to do in-lab data analysis. 34% were using collaborators and 23% felt that they did not have the necessary skills to analyze their own data.

-Do we store published data on university servers which are more stable and kept current than those that exist in many lab currently?  And who the hell is going to pay for this?  80% of those surveyed said they did not have sufficient funding for data curation and 50% of the labs were keeping data archives on computers in their lab.  And as Dorothea has previously brought up, there is a general lack of research IT.

-What are our policies for data requests?  Should there be a standardized policy put forth by the NIH, universities or just leave it up to the individual lab?  Less than half of the labs that asked for data got it, with the other half only getting requests for published data fulfilled "sometimes."

We've got to tame this tiger before it gets too out of hand.

10 responses so far

  • It's a tiger that definitely needs to be tamed. You're right that this is a major issue for many labs, not just how you store data but how you link digital data to more traditional notekeeping, how you index it, etc. It's a big hurdle to individual labs and to open science. I'll have to look up the info later, but one of the scio11 live streams I watched brought up some of these same questions, including how involved institution's libraries and their personnel should be in archiving data. Perhaps I will soon finish my own post on this that's stuck in queue.

  • Christina Pikas says:

    Libraries are trying to figure out how to help with this. That would be more for the completed projects though. Funding is always an issue. Have you talked to your library?

  • Miss MSE says:

    Sadly, my impression is that the biological sciences are ahead of the curve on this one. Materials science doesn't have many repositories to even consider that option, and data varies wildly from subfield to subfield. Worse, most of what we consider fundamental data is really metadata, describing the sample. The metadata can very easily dwarf the raw data when you start trying to be rigorous about keeping it with each data file. The new NSF requirements are supposed to help encourage people to think about data management, but if they are enforced, I can easily see it turning into a copy-paste paragraph for many PIs. My current university happens to be unusually aware of this, but it's still incredibly daunting.

  • brooksPhD says:

    I feel your pain mate. I lead a group whose sole purpose in life is data management for our Institution. We even have the cunningly titled name, "Data Management Core".

    Catchy huh?

    Right now we only have a couple of lab projects working with us, because the Director of my Unit (I'm in one of four cores within that unit) runs his own server-farm for handling everyone's genomics data. We focus primarily on HIPAA protected healthcare data.

    But...it ain't cheap. If you want to use us for anything larger than a small, longitudinal trial (i.e. doesn't take my programmers much work to configure your workflow in our database), then you goitta pay. We run on what I call "funding reciprocity". If you've got a grant or are submitting one, you put us down for FTE coverage, and that way I can guarantee you a bioinformaticist to curate and care for your data. Your minimum is gonna be somewhere in the region of $15-25k/yr though.

    And you also need someone to analyse it and we don't do that. So, it all adds to the bottom line.

    We just released an open source version of our DB for any group with the druthers to download and run it on a SQL server at their own shop (search for PRIME open source; it's on Source Forge).

    As long as you're running PHP 5.3 or above it installs automatically and it's fairly intuitive. We released it under the Affergo GPL3 so anyone can have it and play with it, but you're bound to release any modifications back to the community.

    It's built for healthcare data but there's no reason why it shouldn't work for any format of data.

  • Dr 29 says:

    At my PhD lab the boss was not only insistent ... but very ehem, anal about us archiving data. We had a RAID setup in one of our Linux boxes just dedicated to do daily back up. Boss said that it didn't matter if my data was archived in 2 places, 2 computers and an infinite number of tapes, as long as there was a way to access it. When some of the old PhD students left my PhD lab boss gave me the task of archiving everybody's data, moving them from 1 box to the RAID one and archiving everything on tape. Since my PhD school was filthy rich for certain things and they had a good chunk of bioinformatics people they also had tape archival. So I know my data is safely stored in at least 2 places on campus and boss can access it whenever.

    In my current lab, since the calculations and studies we do don't chew away as much space students can usually carry the data on their laptops and desktops but they're required, before boss signs the documents to "let them go" to hand him a disk with all their files. We don't do tape archiving here, but boss makes sure there are at least 2 copies of the data, in 1 different forms (his computer and a disk, or a combination of disks, or 2 disks one for the boss one for the student).

    Great post and it is of utmost importance to address this. I'd say create a national database, but ALWAYS have a local copy (or copies) of the data.

  • Monex says:

    ..Keeping sensitive data safe from inappropriate access and disclosure is of the utmost importance. Virginia Tech has many policies procedures and standards in place to protect sensitive data. It is the responsibility of everyone handling sensitive data from Virginia Tech to be familiar with these policies procedures and standards. It is important to find out what sensitive data you are handling and what steps are needed to protect it.

Leave a Reply