In this series we are exploring Microsoft Excel from a data governance perspective. Many organizations use Excel heavily as the primary place where data is entered. This is understandable for younger organizations that are still trying to figure out best practices. However, Excel has many limitations that other database platforms do not, and maturing organizations should re-examine their use of Excel as a means of data entry and storage.
One of the key principles of good database design is that data should only be maintained in one place, though it can be referenced from many places. In relational databases this is enforced using tables and keys. A table will be the key store of one piece of information, and a unique identifier/key will allow other tables to reference that information without copying all the data. Excel, with its multiple worksheet structure, does not enforce well the concept of keys and storing data in one place only.
What often happens in an excel spreadsheet is that the same data is copied from one worksheet to another, or even from one spreadsheet to another. As the data is moved into different locations it is inevitably changed and then it becomes difficult to know what the "authoritative" source of data is.
The process of separating data into tables and keys is called database normalization and it is important for a number of reasons. Moving your Excel spreadsheet into a database allows you to break apart the data into logical chunks that can be maintained in one place so that you don't have to remember all the different worksheets where the data might appear.
Another reason why Excel is not a great database is that it does not enforce data types by default. In a relational database if you set a field constraint the underlying system enforces that constraint rigidly to the extent that it does not permit data to be saved that does not fit the data type of the column. In Excel you can set a data type into a column, but by default all that means is that Excel tries to massage whatever you type into that column format. In a spreadsheet with thousands of rows it's hard to find the data points that Excel might have interpreted wrongly.
There is a way to specify and enforce data type constraints on Excel fields but as this behavior is not the default behavior few people know or take the time to set these constraints.
Here is a list of some common data types and the benefits of enforcing data integrity.
Relational databases have a robust transactional system which allows people to operate only on parts of the database that are not currently being changed and to group operations together into units called transactions. These changes, once committed, can then can be propagated to everyone else working on the database in real time. This ensures that two people don't make contradictory changes to the database.
Collaboration using Microsoft Excel often means copying the spreadsheet to multiple people who each make their own changes (or even submit their own spreadsheets), and these spreadsheets are merged together into a master spreadsheet by another person. Because there is no one "authoritative" version of the spreadsheet it is easy to end up with confusion as people make changes in various places.
This might be the biggest reason of all why widespread use of Excel can be a data governance nightmare. If an Excel spreadsheet is not password protected (and most of them are not), all it takes to expose the data is for someone to obtain access to the file in question. This happens in several ways - someone can accidentally send the file to the wrong party, someone can physically or through the network get access to the computer where the spreadsheet is stored. Excel spreadsheets often find their way onto USB keys as well and these are misplaced with alarming frequency.
Relational databases such as SQL Server, Postgres, Oracle, and MySQL are somewhat easier to secure. In these cases there is only one place where the data is stored - presumably on a network or cloud server. This provides two layers of protection around the data - network security and database security.
Backups of relational databases are stored in a format where one must restore the database to a compatible version of the server before one can access the data. This means that it requires a bit of extra sophistication to get access to the data - even if someone gets access to a backup of the data they have to find a way to restore it on a comparable server before they can look at the data. Encrypted backups mitigate this issue even further. This is in contrast to the ease of getting access to data in an Excel spreadsheet - one must simply have a copy of Microsoft Excel on their machine to open a spreadsheet, and almost everyone has Excel.
Looking for an organization that can help you move your critical Excel spreadsheets to more stable and secure platform? Learn more about our data analytics services or contact us and let's start the conversation.
322 North Shore Drive
Building 1B, Suite 200
Pittsburgh, PA 15212
I am interested in information on: