A Day in The Life of Scott Hamilton: September 2019

Thursday, September 19, 2019

High Performance Computing

High Performance Computing (HPC) consists of two main types of computing platforms. There are shared memory platforms that run a single operating system and act as a single computer, where each processor in the system has access to all of the memory. The largest of these available on the market today are Atos’ BullSequana S1600 and HPE’s Superdome, which max out at 16-processors sockets and 24TB of memory. Coming out later this year, the BullSequana S3200 will supply 32-processor sockets and 48TB of memory.

The other type of HPC is called a distributed memory system and it links multiple computers together by a software stack that allows the separate systems to pass memory from one to another, utilizing a message passing library. These systems first came about to replace the expensive shared memory systems with commodity computer hardware, like your laptop or desktop computer. Standards for how to share the memory through message passing were first developed about three decades ago and formed a new computing industry. These systems made a shift from commodity hardware to specialized platforms about twenty years ago with companies like Cray, Bull (now a part of Atos), and SGI leading the pack.

Today the main specialized hardware manufactures are Atos with their Direct Liquid Cooled Sequana XH2000 and HPE with a conglomerate of technologies from the acquisition of both SGI and Cray in the last few years. It is unclear in the industry which product line will be kept through the mergers. HPC used to be purely a scientific research platform used by the likes of NASA and university research programs, but in recent years it has made a shift to being used in nearly every industry, from movies, fashion and toiletries to planes, trains, and automobiles.

The newest use cases for HPC today are in data analytics, machine learning, and artificial intelligence. However, I would say the leading use case for HPC worldwide is still in the fields of computational chemistry and computational fluid dynamics. Computational fluid dynamics studies how fluid or material move or flow through mechanical systems. It is used to model things like how a detergent pours out of the detergent bottle, how a diaper absorbs liquids, and how air flows through a jet turbine for optimal performance. Computational chemistry uses computers to simulate molecular structures and chemical reaction processes, effectively replacing large chemical testing laboratories by large computing platforms.

The two latest innovations that are causing a shift in HPC are cloud computing, Google Cloud Services, Amazon Web Service, Microsoft Azure, and Quantum Computing, which is the next generation of computers that are still under development and are not likely to be easily available for ten years or more.

If you are interested in learning more about HPC, there are a couple of great places to start. The University of Oklahoma hosts a conference that is free and open to the public every September. This year the event is being held on Sept. 24 and 25; more information about the event can be found at http://www.oscer.ou.edu/Symposium2019/agenda.html. There is also professional training available from the Linux Cluster Institute (http://www.linuxclustersinstitute.org). Scott Hamilton, Senior Expert in HPC at Atos has also published a book on the Message Passing Interface designed for beginners in the field. It is available on Amazon through his author page (http://amazon.com/author/techshepherd). There are also several free resources online by searching for HPC or MPI.

Thursday, September 12, 2019

Big Data and data structures

There is a recent term in computer science, “Big Data,” which has a very loose definition and causes a lot of confusion within the industry. Big Data has been in existence since before computers existed. A perfect example of Big Data is ancient history that was recorded on scrolls. A scroll could only hold so much information before it was full and could hold no more. A single scroll was not too much to handle and carry around, but the amount of recorded data quickly expanded to hundreds and thousands of scrolls. This is what we refer to as Big Data, and we are still trying to come up with a solution to handle data that grows too large to be managed easily.

Most companies have an entire division dedicated to the management of data, and they have a big issue to face as we produce and gather more data in a single day than was collected and gathered for all of history prior to the computer age. Big Data is not a problem that will just go away and one of the ways we have begun to manage this landslide of data is to form tighter data structures.

I know, I have now introduced another new term to talk about an old problem. Don’t worry, data structures are easy. Remember the scroll, it had a linear data structure, things were recorded on the scroll in the order that they happened and stored as characters of a written language. This is a very loose structure that is usually referred to as unstructured data, because you can write anything on a scroll. To have real data structure, you need a set format for recording the data. A great example of a data structure you have all seen is your federal income tax return form. They provide a set number of blocks to record your information on the form and reject the form if you go outside of the boundaries. This is a data structure in paper format.

So how do data structures help to manage Big Data? The biggest way is by keeping the data in a known order, with a known size and known fields. For example, you might want to keep an address book; it would have all your friends’ names, addresses, phone numbers, and birthdays. What if you just started writing your friends’ information on a blank sheet of paper in a random order?

Bill, 9/1/73, 123 Main Street, Smith, MO, Licking, 65462, John Licking, Stevens, MO, 4/23/85, 573-414-5555, 65462, 573-341-5565, 123 Cedar Street.

It would become quickly impossible to find anyone’s contact information in your address book, and even with the two friends in my example, you already have a Big Data problem; we don’t know what information belongs together.

If we take the same two people and provide a structure for the data, it suddenly becomes much more usable.

Bill Smith, 123 Main Street, Licking, MO 65462, 537-414-5555, 9/1/73; John Stevens, 123 Cedar Street, Licking, MO 65462, 573-341-5565, 4/23/85.

It is still not easily readable by a computer, because even though there is a known order, we have a field separator, the comma, but there is no known length, which complicates things for computer software. A computer likes to store data structures of a known length, so you need to define a size for each data field, and a character to represent empty space. In my example we will use 15 characters for every field and ^ will be an empty space.

Our address book now looks like this:

Bill Smith^^^^^

123 Main Street

Licking,_MO^^^^

65542^^^^^^^^^^

573-414-5555^^^

09/01/1973^^^^^

John Stevens^^^

123 Cedar Stree

Licking,_MO^^^^

65542^^^^^^^^^^

573-341-5565^^^

04/23/1985^^^^^

As you can see it can get very complicated to set up a data structure; each field can be a different size, but if you make it too short, like the address field in the example above, you lose information, or if it’s too long like the zip code field, you waste space. This is the focus of data structures and an interesting field of study.

Thursday, September 5, 2019

Blockchain outstanding questions

Last week we left a few unanswered questions while talking about blockchain technologies. I would like to address those unanswered questions this week. The first one that comes to mind is the problem of double spending. Double spending can occur with a digital currency or any digital payment processing system, unless all payments are authorized by a single, central authority. Blockchain does not have a single central authority so in the early stages of development, the main problem that had to be addressed was the ability for the same currency to be used in two transactions simultaneously, resulting in a double spend.

The double spend problem was solved by only allowing a single path in the chain, and making each link depend on the hash code of the prior link. If two efforts to spend on the same chain occurred at exactly the same time, only one transaction would be processed. The other would have an invalid hash code linking to the previous transaction and be ruled invalid. This solution raised the second question, “What is a hash?”

A hash is created by a computational function, called a hash function. A hash function maps data of any size to a fixed size value. There are three basic rules to a hash function. The first is that each time you encode, or hash, the data you get the same results. The second is that small changes, even a single character change in the data must result in a different hash. The third is that a hash function cannot be reversed, meaning that you cannot use the hash to recreate the original data. Two pieces of data with a large difference can result in the same hash. An example of a trivial hash function is a function that maps names to a two-digit number. John Smith is 02, Lisa Smith is 01, Sam Doe is 04, and Sandra Dee is also 02. We won’t get into exactly how the mapping takes place, because it is a very advanced topic. All we need to know to understand how a hash function works is that it maps input data to a given set of specific values, like the example maps names to numbers between 00 and 99.

We mentioned that the hash function is used to tie the links in the blockchain together. The links form a Merkle tree. In cryptography and computer science, a Merkle tree is a data structure that links data in a single direction from a leaf to parent. In a Merkle tree, each leaf node is labeled with the hash of the data it contains, and every parent node is labeled with the cryptographic hash of the labels of its child’s nodes. Merkle trees allow for efficient and secure verification of the contents of large data structures, like transactional databases used in blockchains.

Merkle trees are named for Ralph Merkle who patented the technology in 1979. They are used to verify any stored data that is handled and transferred in and between computers. They ensure that datablocks are not changed by other peers in a peer-to-peer network, either by accidental corruption or fake blocks created by malicious systems on the network. This makes it difficult, but not impossible, to introduce fake data into a blockchain as it merely requires creating a data block that matches the hash of the block you are replacing, in effect, corrupting the tree. However, generating the fake data block is a time-consuming process and likely to not be completed by the time the next real block is generated, making it impossible to inject your change.

Join me again next week for an overview of data-structures and their applications.

A Day in The Life of Scott Hamilton