Thursday, September 12, 2019

Big Data and data structures


There is a recent term in computer science, “Big Data,” which has a very loose definition and causes a lot of confusion within the industry. Big Data has been in existence since before computers existed. A perfect example of Big Data is ancient history that was recorded on scrolls. A scroll could only hold so much information before it was full and could hold no more. A single scroll was not too much to handle and carry around, but the amount of recorded data quickly expanded to hundreds and thousands of scrolls. This is what we refer to as Big Data, and we are still trying to come up with a solution to handle data that grows too large to be managed easily.

Most companies have an entire division dedicated to the management of data, and they have a big issue to face as we produce and gather more data in a single day than was collected and gathered for all of history prior to the computer age. Big Data is not a problem that will just go away and one of the ways we have begun to manage this landslide of data is to form tighter data structures. 

I know, I have now introduced another new term to talk about an old problem. Don’t worry, data structures are easy. Remember the scroll, it had a linear data structure, things were recorded on the scroll in the order that they happened and stored as characters of a written language. This is a very loose structure that is usually referred to as unstructured data, because you can write anything on a scroll. To have real data structure, you need a set format for recording the data. A great example of a data structure you have all seen is your federal income tax return form. They provide a set number of blocks to record your information on the form and reject the form if you go outside of the boundaries.  This is a data structure in paper format.

So how do data structures help to manage Big Data? The biggest way is by keeping the data in a known order, with a known size and known fields. For example, you might want to keep an address book; it would have all your friends’ names, addresses, phone numbers, and birthdays. What if you just started writing your friends’ information on a blank sheet of paper in a random order?

Bill, 9/1/73, 123 Main Street, Smith, MO, Licking, 65462, John Licking, Stevens, MO, 4/23/85, 573-414-5555, 65462, 573-341-5565, 123 Cedar Street. 

It would become quickly impossible to find anyone’s contact information in your address book, and even with the two friends in my example, you already have a Big Data problem; we don’t know what information belongs together.

If we take the same two people and provide a structure for the data, it suddenly becomes much more usable. 

Bill Smith, 123 Main Street, Licking, MO 65462, 537-414-5555, 9/1/73; John Stevens, 123 Cedar Street, Licking, MO 65462, 573-341-5565, 4/23/85. 

It is still not easily readable by a computer, because even though there is a known order, we have a field separator, the comma, but there is no known length, which complicates things for computer software. A computer likes to store data structures of a known length, so you need to define a size for each data field, and a character to represent empty space. In my example we will use 15 characters for every field and ^ will be an empty space.

Our address book now looks like this:

Bill Smith^^^^^
123 Main Street
Licking,_MO^^^^
65542^^^^^^^^^^
573-414-5555^^^
09/01/1973^^^^^
John Stevens^^^
123 Cedar Stree
Licking,_MO^^^^
65542^^^^^^^^^^
573-341-5565^^^
04/23/1985^^^^^


No comments: