The trick around managing data is to manage out complexity and manage in the innovation. Easier said than done so I am going to break down some of the ways we can do it and I am pretty sure you have not seen this in the mainstream blogs. Let’s get down to it.
Unstructured data usually is this idea about how data is all sorts of things except really setup to go into a database. However, like most data, if you go a level or so deeper some clarity appears. Unstructured data when broken into repetitive and non-repetitive data begins to show some trends that are usable. Repetitive data is going to have no database standard structure, but you know it will be the same sort of thing. For instance, smart meter data comes back repetitively and is unstructured, but we can find ways to manage it. Non-repetitive unstructured data is more like emails, documents, Twitter feeds and call center data. Call center data becomes useful when you combine it with online web data to understand potential patterns in behavior. We have pattern matched this for banks and retail stores to understand when returns or cancellation of contracts might occur. However, emails and social media are much more contextual. You really need to parse that data (regular expressions are a good place to start) to understand the context. Someone could say “Great!” and be positive or sometimes be sarcastic. How do you break that down? Often it is not just a word but the sentence from which you need to infer or even a pattern of behavior of that user from which you need to infer.
This is where the complexity of analytics really begins to take shape and where your internal process and design needs to start thinking about how to break down social media into blocks that are understandable and also able to give analytic value. Storing this data does not give you the ability to aggregate or generate intelligence unless you spend significant time parsing and understand that information.
So, fundamentally your architecture needs to be able to really do simple analytics at scale (repetitive) and also complex analytics at scale (non-repetitive) and likely combine this complex set with other orthogonal data. Orthogonal data you can think of is added context data that doesn’t usually relate directly to the dataset in question. If, for instance, you are looking at cars that run red lights you have a data set that is a quantitative, but if you want to add context look at time and say sunrise and sunset. With the additional data inputs, you can then see that more of the red lights are run during sunset implying that maybe the driver could not see the light at that time. Orthogonal data sets can often add context where none can be found.
In the complexity of non-repetitive data is where you will spend most of your time looking for understanding and figuring out ways to apply logic and taxonomies where needed to add that context you need. Taxonomies work nicely to do a contextual sort when a physical one cannot be done. Using that ability when you are looking at a sentence such as, “His rage from being a Taurus meant he crashed his Ford Taurus into the wall” you can use the Horoscope symbols to make inference or car models to make it. Depending on the ones you use means different data actions and inferences. Although I am highlighting a lot of complexity, I am also pointing out that it is in some places and not in others. How you design architecture depends on the sort of data you have and next you need to acquire the tools to address scale, complexity, and simplicity.
Medical records have come a long way, but notes are still somewhat difficult to interpret. One way to do it is understanding the roles that are in play. In a medical facility you will have facility coordinators, nurses, orderlies, doctors and often research faculty (if you have an academic medical facility). All of them will use a hospital wide system (Often Epic is used) to input patient and facility information. Now when storing notes in a system you can store a picture of it but that does not give you information unless you access it directly. Assume in the notes someone has written “Na”. How would you go about interpreting that? Well we talked about Taxonomy so start there but then look at the roles that are in play. A facility coordinator may put a note in there with “Na” about beds not being available. A nurse may put that in as meaning a certain medication is not available. A medical research assistant may be looking at the periodic table and referencing Sodium (Na) as part of a clinical trial. So Taxonomy and roles can help you remove areas where you may interpret incorrectly and get closer to what you need. There are many other ways to interpret and categorize data but what I hope to show you is that you will need to get in there and apply that business context yourself because refining patterns for your customers will be similar to the examples I have used. The value of your data is essentially hidden in its ambiguity and you will need to leverage a solid approach to move it to intelligence. I will be covering the approach we use to build out a solid approach to data that is extremely complex without bringing your existing architecture to a halt in coming articles. While most data experts will cover regular architectures, getting into the details of frameworks to support agile architectures to support next generation operating platforms is something few will cover. Since we are all going to have to do this at some point, we are going to release the framework and the architectures recommended on my blog. Get ready to get into the complexity of semantic understanding to take your data analytics to the next level.