I’m conscious that the IT world is moving in the direction of “Serverless” code, where business logic is loaded to a service and the infrastructure underneath abstracted away. In that way, it can be woken up from dormant and scaled up and down automatically, in line with the size of the workload being put on it. Until then, I wanted (between interim work assignments) to set up a home project to implement a business idea I had some time back.
I looked for the simplest way to run up a service on all the main cloud vendors. After half a day of research, elected to try Django on Digital Ocean, where a “one click install” was available. This looked the simplest way to install Django on any of the major cloud vendors. It took 30 minutes end to end to run the instance up, ready to go; that was until I realised it was running an old version of Django (1.08), and used Python 2.7 — which is not supported by the (then) soon to be released 2.0 version of Django. So, off I went trying to build everything ground up.
The main requirement was that I was developing on my Mac, but the production version in the cloud on a Linux instance — so I had to set up both. I elected to use PostgreSQL as the database, Nginx with Gunicorn as the web server stack, used Lets Encrypt (as recommended by the EFF) for certificates and Django 1.11 — the latest version when I set off. Local development environment using Microsoft Visual Studio Code alongside GitHub.
One of the nuances on Django is that users are normally expected to login with a username different from their email address. I really wanted my app to use a persons email address as their only login username, so I had to put customisations into the Django set-up to achieve that along the way.
A further challenge is that target devices used by customers are heavily weighted to mobile phones on other sites I run, so I elected to use Google’s Material user interface guidelines. The Django implementation is built on an excellent framework i’ve used in another project, as built by four Stanford graduates — MaterializeCSS — and supplemented by a lot of custom work on template tags, forms and layout directives by Mikhail Podgurskiy in a package called django-material (see: http://forms.viewflow.io/).
The mission was to get all the above running before I could start adding my own authentication and application code. The end result is an application that will work nicely on phones, tablets or PCs, resizing automatically as needed.
It turned out to be a major piece of work just getting the basic platform up and running, so I noted all the steps I took (as I went along) just in case this helps anyone (or the future me!) looking to do the same thing. If it would help you (it’s long), just email me at [email protected]. I’ve submitted it back to Digital Ocean, but happy to share the step by step recipe.
Alternatively, hire me to do it for you!
Best Read of the Year, not just for high technology, but for a reasoned meaning behind political events over the last two years, both in the UK and the USA. I can relate it straight back to some of the prescient statements made by Jeff Bezos about Amazon “Day 1” disciplines: the best defence against an organisations path to oblivion being:
- customer obsession
- a skeptical view of proxies
- the eager adoption of external trends, and
- high-velocity decision making
Things go off course when interests divide in a zero-sum way between different customer groups that you serve, and where proxies indicating “success” diverge from a clearly defined “desired outcome”.
The normal path is to start with your “customer” and give an analogue of what indicates “success” for them in what you do; a clear understanding of the desired outcome. Then the measures to track progress toward that goal, the path you follow to get there (adjusting as you go), and a frequent review that steps still serve the intended objective.
Fake News on Social Media, Finance Industry Meltdowns, unfettered slavery to “the market” and to “shareholder value” have all been central to recent political events in both the UK and the USA. Politicians of all colours were complicit in letting proxies for “success” dissociate fair balance of both wealth and future prospects from a vast majority of the customers they were elected to serve. In the face of that, the electorate in the UK bit back – as they did for Trump in the US too.
Part 3 of the book, entitled “A World Ruled by Algorithms” – pages 153-252 – is brilliant writing on our current state and injustices. Part 4 (pages 255-350) entitled “It’s up to us” maps a path to brighter times for us and our descendants.
The barriers to fresh thinking are even higher in politics than in business. The Overton Window, a term introduced by Joseph P. Overton of the Mackinac Center for Public Policy, says that an ideas political viability falls within a window framing a range of policies considered politically acceptable in the current climate of public opinion. There are ideas that a politician simply cannot recommend without being considered too extreme to gain or keep public office.
In the 2016 US presidential election, Donald Trump didn’t just push the Overton Window far too to right, he shattered it, making statement after statement that would have been disqualifying for any previous candidate. Fortunately, once the window has come unstuck, it is possible to move it radically new directions.
He then says that when such things happen, as they did at the time of the Great Depression, the scene is set to do radical things to change course for the ultimate greater good. So, things may well get better the other side of Trumps outrageous pandering to the excesses of the right, and indeed after we see the result of our electorates division over BRexit played out in the next 18 months.
One final thing that struck me was how one political “hot potato” issue involving Uber in Taiwan got very divided and extreme opinions split 50/50 – but nevertheless got reconciled to everyone’s satisfaction in the end. This using a technique called Principal Component Analysis (PCA) and a piece of software called “Pol.is”. This allows folks to publish assertions, vote and see how the filter bubbles evolve through many iterations over a 4 week period. “I think Passenger Liability Insurance should be mandatory for riders on UberX private vehicles” (heavy split votes, 33% both ends of the spectrum) evolved to 95% agreeing with “The Government should leverage this opportunity to challenge the taxi industry to improve their management and quality control system, so that drivers and riders would enjoy the same quality service as Uber”. The licensing authority in Taipei duly followed up for the citizens and all sides of that industry.
I wonder what the BRexit “demand on parliament” would have looked like if we’d followed that process, and if indeed any of our politicians could have encapsulated the benefits to us all on either side of that question. I suspect we’d have a much clearer picture than we do right now.
In summary, a superb book. Highly recommended.
Your DNA is a string of protein pairs that encapsulate your “build” instructions, as inherited from your birth parents. While copies of it are packed tightly into every cell in, and being given off, your body, it is of considerable size; a machine representation of it is some 2.6GB in length – the size of a blue-ray DVD.
The total entity – the human genome – is a string of C-G and A-T protein pairs. The exact “reference” structure, given the way in which strands are structured and subsections decoded, was first successfully concluded in 2003. It’s absolute accuracy has gradually improved regularly as more DNA samples have been analysed down the years since.
A sequencing machine will typically read short lengths of DNA chopped up into pieces (in a random pile, like separate pieces of a jigsaw), and by comparison against a known reference genome, gradually piece together which bit fits where; there are known ‘start’ and ‘end’ segment patterns along the way. To add a bit of complexity, the chopped read may get scanned backwards, so a lot of compute effort to piece a DNA sample into what it looks like if we were able to read it uninterrupted from beginning to end.
At the time of writing (July 2017), we’re up to version 38 of the reference human genome. 23andMe currently use version 37 for their data to surface inherited medical traits. Most of the DNA sampling industry trace family history reliably using version 36, and hence most exports to work with common DNA databases automatically “downgrade” to that version for best consistency.
DNA has 46 sections (known as Chromosomes); 23 of them come from your birth father, 23 from your birth mother. While all humans have over 99% commonality, the 1% difference make every one of us (or a pair of identical twins) statistically unique.
The cost to sample your own DNA – or that of a relative – is these days in the range of £79-£149. The primary one looking for inherited medical traits is 23andMe. The biggest volume for family tree use is AncestryDNA. That said, there are other vendors such as Family Tree DNA (FTDNA) and MyHeritage that also offer low cost testing kits.
The Ancestry DNA database has some 4 million DNA samples to match against, 23andMe over 1 million. The one annoyance is that you can’t export your own data from these two and then insert it in the other for matching purposes (neither have import capabilities). However, all the major vendors do allow exports, so you can upload your data from AncestryDNA or 23andMe into FTDNA, MyHeritage and to the industry leading cross-platform GEDmatch DNA databases very simply.
Exports create a ZIP file. With FTDNA, MyHeritage and GEDmatch, you request an import, and these prompt for the name of that ZIP file itself; you have no need to break it open first at all.
On receipt of the testing kit, register the code on the provided sample bottle on their website. Just avoid eating/drinking for 30 minutes, spit into the provided tube up to the level mark, seal, put back in the box, seal it and pop it in a postbox. Results will follow in your account on their website in 2-4 weeks.
Family Tree matching
Once you receive your results, Ancestry and 23andMe will give you details of any suggested family matches on their own databases. The primary warning here is that matches will be against your birth mother and whoever made her pregnant; given historical unavailability of effective birth control mechanisms and the secrecy of adoption processes, this has been known to surface unexpected home truths. Relatives trace up and down the family tree from those two reference points. A quick gander of self help forums on social media can be entertaining, or a litany of horror stories – alongside others of raw delight. Take care, be ready for the unexpected:
My first social media experience was seeing someone confirm a doctor as her birth father. Her introductory note to him said that he may remember her Mum, as she used to be his nursing assistant.
Another was to a man, who once identified admitted to knowing her birth mother in his earlier years – but said it couldn’t be him “as he’d never make love with someone that ugly”.
Outside of those, fairly frequent outright denials questioning the fallibility of the science behind DNA testing, none of which stand up to any factual scrutiny. But among the stories, there are also stories of delight in all parties when long lost, separated or adopted kids locate, and successfully reconnect, with one or both birth parents and their families.
Loading into other databases, such as GEDmatch
In order to escape the confines of vendor specific DNA databases, you can export data from almost any of the common DNA databases and reload the resulting ZIP file into GEDmatch. Once imported, there’s quite a range of analysis tools sitting behind a fairly clunky user interface.
The key discovery tool is the “one to many” listing, which does a comparison of your DNA against everyone elses in the GEDmatch database – and lists partial matches in order of closeness to your own data. It does this using a unit of measure called “centiMorgans”, abbreviated “cM”. Segments that show long exact matches are totted up, giving a total proportion of DNA you share. If you matched yourself or an identical twin, you’d match a total of circa 6800cM. Half your DNA comes from each birth parent, so they’d show as circa 3400cM. From your grandparents, half again. As your family tree extends both upwards and sideways (to uncles, aunts, cousins, their kids, etc), the numbers will increasingly dilute by half each step; you’ll likely be in the thousands of potential matches 4 or 5 steps away from your own data:
If you want to surface birth parent, child, sibling, half sibling, uncle, aunt, niece, nephew, grandparent and grandchild relationships reliably, then only matches of greater than 1300cM are likely to have statistical significance. Any lower than that is an increasingly difficult struggle to fill out a family tree, usually persued by asking other family members to get their DNA tested; it is fairly common for GEDmatch to give you details (including email addresses) of 1-2,000 closest matches, albeit sorted in descending ‘close-ness’ order for you).
As one example from GEDmatch, the highlighted line shows a match against one of the subjects parents (their screen name and email address cropped off this picture):
There are more advanced techniques to use a Chromosome browser to pinpoint whether a match comes down a male line or not (to help understand which side of the tree relationships a match is more likely to reside on), but these are currently outside my own knowledge (and current personal need).
Future – take care
One of the central tenets of the insurance industry is to scale societal costs equitably across a large base of folks who may, at random, have to take benefits from the funding pool. To specifically not prejudice anyone whose DNA may give indications of inherited risks or pre-conditions that may otherwise jeopardise their inclusion in cost effective health insurance or medical coverage.
Current UK law specifically makes it illegal for any commercial company or healthcare enterprise to solicit data, including DNA samples, where such provision may prejudice the financial cost, or service provision, to the owner of that data. Hence, please exercise due care with your DNA data, and with any entity that can associate that data with you as a uniquely identifiable individual. Wherever possible, only have that data stored in locations in which local laws, and the organisations holding your data, carry due weight or agreed safe harbour provisions.
Country/Federal Law Enforcement DNA records.
The largest DNA databases in many countries are held, and administered, for police and criminal justice use. A combination of crime scene samples, DNA of known convicted individuals, as well as samples to help locate missing people. The big issue at the time of writing is that there’s no ability to volunteer any submission for matching against missing person or police held samples, even though those data sets are fairly huge.
Access to such data assets are jealously guarded, and there is no current capability to volunteer your own readings for potential matches to be exposed to any case officer; intervention is at the discretion of the police, and they usually do their own custom sampling process and custom lab work. Personally, a great shame, particularly for individuals searching for a missing relative and seeking to help enquiries should their data help identify a match at some stage.
I’d personally gladly volunteer if there were appropriate safeguards to keep my physical identity well away from any third party organisation; only to bring the match to the attention of a case officer, and to leave any feedback to interested relatives only at their professional discretion.
I’d propose that any matches over 1300 cM (CentiMorgans) get fed back to both parties where possible, or at least allow cases to get closed. That would surface birth parent, child, sibling, half sibling, uncle, aunt, niece, nephew, grandparent and grandchild relationships reliably.
At the moment, police typically won’t take volunteer samples unless a missing person is vulnerable. Unfortunately not yet for tracing purposes.
Come join in – £99 is all you need to start
Whether for medical traits knowledge, or to help round out your family trees, now is a good time to get involved cost effectively. Ancestry currently add £20 postage to their £79 testing kit, hence £99 total. 23andMe do ancestry matching, Ethnicity and medical analyses too for £149 or so all in. However, Superdrug are currently selling their remaining stock of 23andMe testing kits (bought when the US dollar rate was better than it now is) for £99. So – quick, before stock runs out!
Either will permit you to load the raw data, once analysed, onto FTDNA, MyHeritage and GEDmatch when done too.
Never a better time to join in.
One of the early lessons you pick up looking at product lifecycles is that some people hold out buying any new technology product or service longer than anyone else. You make it past the techies, the visionaries, the early majority, late majority and finally meet the laggards at the very right of the diagram (PDF version here). The normal way of selling at that end of the bell curve is to embed your product in something else; the person who swore they’d never buy a Microprocessor unknowingly have one inside the controls on their Microwave, or 50-100 ticking away in their car.
In 2016, Google started releasing access to its Vision API. They were routinely using their own Neural networks for several years; one typical application was taking the video footage from their Google Maps Streetview cars, and correlating house numbers from video footage onto GPS locations within each street. They even started to train their own models to pick out objects in photographs, and to be able to annotate a picture with a description of its contents – without any human interaction. They have also begun an effort to do likewise describing the story contained in hundreds of thousands of YouTube videos.
One example was to ask it to differentiate muffins and dogs:
This is does with aplomb, with usually much better than human performance. So, what’s next?
One notable time in Natural History was the explosion in the number of species on earth that occured in the Cambrian period, some 534 million years ago. This was the time when it appears life forms first developed useful eyes, which led to an arms race between predators and prey. Eyes everywhere, and brains very sensitive to signals that come that way; if something or someone looks like they’re staring at you, sirens in your conscience will be at full volume.
Once a neural network is taught (you show it 1000s of images, and tell it which contain what, then it works out a model to fit), the resulting learning can be loaded down into a small device. It usually then needs no further training or connection to a bigger computer nor cloud service. It can just sit there, and report back what it sees, when it sees it; the target of the message can be a person or a computer program anywhere else.
While Google have been doing the heavy lifting on building the learning models in the cloud, Apple have slipped in with their own CloudML data format, a sort of PDF for the resulting machine learning data formats. Then using the Graphics Processing Units on their iPhone and iPad devices to run the resulting models on the users device. They also have their ARkit libraries (as in “Augmented Reality”) to sense surfaces and boundaries live on the embedded camera – and to superimpose objects in the field of view.
With iOS 11 coming in the autumn, any handwritten notes get automatically OCR’d and indexed – and added to local search. When a document on your desk is photo’d from an angle, it can automatically flatten it to look like a hi res scan of the original – and which you can then annotate. There are probably many like features which will be in place by the time the new iPhone models arrive in September/October.
However, tip of the iceberg. When I drive out of the car park in the local shopping centre here, the barrier automatically raises given the person with the ticket issued to my car number plate has already paid. And I guess we’re going to see a Cambrian explosion as inexpensive “eyes” get embedded in everything around us in our service.
With that, one example of what Amazon are experimenting with in their “Amazon Go” shop in Seattle. Every visitor a shoplifter: https://youtu.be/NrmMk1Myrxc
Lots more to follow.
PS: as a footnote, an example drawing a ruler on a real object. This is 3 weeks after ARkit got released. Next: personalised shoe and clothes measurements, and mail order supply to size: http://www.madewitharkit.com/post/162250399073/another-ar-measurement-app-demo-this-time
One thing that bemused the hell out of me – as a Software guy visiting prospective PC dealers in 1983 – was our account manager for the North UK. On arrival at a new prospective reseller, he would take a tape measure out, and measure the distance between the nearest Directors Car Parking Slot, and their front door. He’d then repeat the exercise for the nearest Visitors Car Parking Spot and the front door. And then walk in for the meeting to discuss their application to resell our range of Personal Computers.
If the Directors slot was closer to the door than the Visitor slot, the meeting was a very short one. The positioning betrayed the senior managements attitude to customers, which in countless cases I saw in other regions (eventually) to translate to that Company’s success (or otherwise). A brilliant and simple leading indicator.
One of the other red flags when companies became successful was when their own HQ building became ostentatious. I always wonder if the leaders can manage to retain their focus on their customers at the same time as building these things. Like Apple in a magazine today:
And then Salesforce, with the now tallest building in San Francisco:
I do sincerely hope the focus on customers remains in place, and that none of the customers are adversely upset with where each company is channeling it’s profits. I also remember a Telco Equipment salesperson turning up at his largest customer in his new Ferrari, and their reaction of disgust that unhinged their long term relationship; he should have left it at home and driven in using something more routine.
Modesty and Frugality are usually a better leading indicator of delivering good value to folks buying from you. As are all the little things that demonstrate that the success of the customer is your primary motivation.
My perception is as follows. I’m also happy to be told I’m mad, or delusional, or both – but here goes. Most reflect changes well past the industry move from CapEx led investments to Opex subscriptions of several years past, and indeed the wholesale growth in use of Open Source Software across the industry over the last 10 years. Your own Mileage, or that of your Organisation, May Vary:
- if anyone says the words “private cloud”, run for the hills. Or make them watch https://youtu.be/URvWSsAgtJE. There is also an equivalent showing how to build a toaster for $15,000. The economics of being in the business of building your own datacentre infrastructure is now an economic fallacy. My last months Amazon AWS bill (where I’ve been developing code – and have a one page site saying what the result will look like) was for 3p. My Digital Ocean server instance (that runs a network of WordPress sites) with 30GB flash storage and more bandwidth than I can shake a stick at, plus backups, is $24/month. Apart from that, all I have is subscriptions to Microsoft, Github and Google for various point services.
- Most large IT vendors have approached cloud vendors as “sell to”, and sacrificed their own future by not mapping customer landscapes properly. That’s why OpenStack is painting itself into a small corner of the future market – aimed at enterprises that run their own data centres and pay support costs on a per software instance basis. That’s Banking, Finance and Telco land. Everyone else is on (or headed to) the public cloud, for both economic reasons and “where the experts to manage infrastructure and it’s security live” at scale.
- The War stage of Infrastructure cloud is over. Network effects are consolidating around a small number of large players (AWS, Google Cloud Platform, Microsoft Azure) and more niche players with scale (Digital Ocean among SME developers, Softlayer in IBM customers of old, Heroku with Salesforce, probably a few hosting providers).
- There is a lot of focus on using Containers as a delivery mechanism for scale out infrastructure, and management tools to orchestrate their environment. Go, Chef, Jenkins, Kubernetes, none of which I have operational experience with (as I’m building new apps have less dependencies on legacy code and data than most). Continuous Integration and DevOps often cited in environments were custom code needs to be deployed, with Slack as the ultimate communications tool to warn of regular incoming updates. Having been at one startup for a while, it often reminded me of the sort of military infantry call of “incoming!” from the DevOps team.
- There are some laudable efforts to abstract code to be able to run on multiple cloud providers. FOG in the Ruby ecosystem. CloudFoundry (termed BlueMix in IBM) is executing particularly well in large Enterprises with investments in Java code. Amazon are trying pretty hard to make their partners use functionality only available on AWS, in traditional lock-in strategy (to avoid their services becoming a price led commodity).
- The bleeding edge is currently “Function as a Service”, “Backend as a Service” or “Serverless apps” typified with Amazon Lambda. There are actually two different entities in the mix; one to provide code and to pay per invocation against external events, the other to be able to scale (or contract) a service in real time as demand flexes. You abstract all knowledge of the environment away.
- Google, Azure and to a lesser extent AWS are packaging up API calls for various core services and machine learning facilities. Eg: I can call Google’s Vision API with a JPEG image file, and it can give me the location of every face (top of nose) on the picture, face bounds, whether each is smiling or not). Another that can describe what’s in the picture. There’s also a link into machine learning training to say “does this picture show a cookie” or “extract the invoice number off this image of a picture of an invoice”. There is an excellent 35 minute discussion on the evolving API landscape (including the 8 stages of API lifecycle, the need for honeypots to offset an emergent security threat and an insight to one impressive Uber API) on a recent edition of the Google Cloud Platform Podcast: see http://feedproxy.google.com/~r/GcpPodcast/~3/LiXCEub0LFo/
- Microsoft and Google (with PowerApps and App Maker respectively) trying to remove the queue of IT requests for small custom business apps based on company data. Though so far, only on internal intranet type apps, not exposed outside the organisation). This is also an antithesis of the desire for “big data”, which is really the domain of folks with massive data sets and the emergent “Internet of Things” sensor networks – where cloud vendor efforts on machine learning APIs can provide real business value. But for a lot of commercial organisations, getting data consolidated into a “single version of the truth” and accessible to the folks who need it day to day is where PowerApps and AppMaker can really help.
- Mobile apps are currently dogged by “winner take all” app stores, with a typical user using 5 apps for almost all of their mobile activity. With new enhancements added by all the major browser manufacturers, web components will finally come to the fore for mobile app delivery (not least as they have all the benefits of the web and all of those of mobile apps – off a single code base). Look to hear a lot more about Polymer in the coming months (which I’m using for my own app in conjunction with Google Firebase – to develop a compelling Progressive Web app). For an introduction, see: https://www.youtube.com/watch?v=VBbejeKHrjg
- Overall, the thing most large vendors and SIs have missed is to map their customer needs against available project components. To map user needs against axes of product life cycle and value chains – and to suss the likely movement of components (which also tells you where to apply six sigma and where agile techniques within the same organisation). But more eloquently explained by Simon Wardley: https://youtu.be/Ty6pOVEc3bA
There are quite a range of “end of 2016” of surveys I’ve seen that reflect quite a few of these trends, albeit from different perspectives (even one that mentioned the end of Java as a legacy language). You can also add overlays with security challenges and trends. But – what have I missed, or what have I got wrong? I’d love to know your views.
A lot of the political effort in the UK appears to circle around a government justifying and handing off parts of our NHS delivery assets to private enterprises, despite the ultimate model (that of the USA healthcare industry) costing significantly more per capita. Outside of politicians lining their own pockets in the future, it would be easy to conclude that few would benefit by such changes; such moves appear to be both economically farcical and firmly against the public interest. I’ve not yet heard any articulation of a view that indicates otherwise. But less well discussed are the changes that are coming, and where the NHS is uniquely positioned to pivot into the future.
There is significant work to capture DNA of individuals, but these are fairly static over time. It is estimated that there are 10^9 data points per individual, but there are many other data points – which change against a long timeline – that could be even more significant in helping to diagnose unwanted conditions in a timely fashion. To flip the industry to work almost exclusively to preventative and away from symptom based healthcare.
I think I was on the right track with an interest in Microbiome testing services. The gotcha is that commercial services like uBiome, and public research like the American (and British) Gut Project, are one-shot affairs. Taking a stool, skin or other location sample takes circa 6,000 hours of CPU wall time to reconstruct the 16S rRNA gene sequences of a statistically valid population profile. Something I thought I could get to a super fast turnaround using excess capacity (spot instances – excess compute power you can bid to consume when available) at one or more of the large cloud vendors. And then to build a data asset that could use machine learning techniques to spot patterns in people who later get afflicted by an undesirable or life threatening medical condition.
The primary weakness in the plan is that you can’t suss the way a train is travelling by examining a photograph taken looking down at a static railway line. You need to keep the source sample data (not just a summary) and measure at regular intervals; an incidence of salmonella can routinely knock out 30% of your Microbiome population inside 3 days before it recovers. The profile also flexes wildly based on what you eat and other physiological factors.
The other weakness is that your DNA and your Microbiome samples are not the full story. There are many other potential leading indicators that could determine your propensity to become ill that we’re not even sampling. The questions of which of our 10^18 different data points are significant over time, and how regularly we should be sampled, are open questions
Experience in the USA is that in environments where regular preventative checkups of otherwise healthy individuals take place – that of Dentists – have managed to lower the cost of service delivery by 10% at a time where the rest of the health industry have seen 30-40% cost increases.
So, what are the measures that should be taken, how regularly and how can we keep the source data in a way that allows researchers to employ machine learning techniques to expose the patterns toward future ill-health? There was a good discussion this week on the A16Z Podcast on this very subject with Jeffrey Kaditz of Q Bio. If you have a spare 30 minutes, I thoroughly recommend a listen: https://soundcloud.com/a16z/health-data-feedback-loop-q-bio-kaditz.
That said, my own savings are such that I have to refocus my own efforts elsewhere back in the IT industry, and my MicroBiome testing service Business Plan mothballed. The technology to regularly sample a big enough population regularly is not yet deployable in a cost effective fashion, but will come. When it does, the NHS will be uniquely positioned to pivot into the sampling and preventative future of healthcare unhindered.
One of my pet hates is seeing my wife visit the doctor, getting hunches of what may be afflicting her health, and this leading to a succession of “oh, that didn’t work – try this instead” visits for several weeks. I just wonder how much cost could be squeezed out of the process – and lack of secondary conditions occurring – if the root causes were much easier to identify reliably. I then wonder if there is a process to achieve that, especially in the context of new sensors coming to market and their connectivity to databases via mobile phone handsets – or indeed WiFi enabled, low end Bluetooth sensor hubs aka the Apple Watch.
I’ve personally kept a record of what i’ve eaten, down to fat, protein and carb content (plus my Monday 7am weight and daily calorie intake) every day since June 2002. A precursor to the future where devices can keep track of a wide variety of health signals, feeding a trend (in conjunction with “big data” and “machine learning” analyses) toward self service health. My Apple Watch has a years worth of heart rate data. But what signals would be far more compelling to a wider variety of (lack of) health root cause identification if they were available?
There is currently a lot of focus on Genetics, where the Human Genome can betray many characteristics or pre-dispositions to some health conditions that are inherited. My wife Jane got a complete 23andMe statistical assessment several years ago, and has also been tested for the BRCA2 (pronounced ‘bracca-2’) gene – a marker for inherited pre-disposition to risk of Breast Cancer – which she fortunately did not inherit from her afflicted father.
A lot of effort is underway to collect and sequence the complete Genome sequences from the DNA of hundreds of thousands of people, building them into a significant “Open Data” asset for ongoing research. One gotcha is that such data is being collected by numerous organisations around the world, and the size of each individuals DNA (assuming one byte to each nucleotide component – A/T or C/G combinations) runs to 3GB of base pairs. You can’t do research by throwing an SQL query (let alone thousands of machine learning attempts) over that data when samples are stored in many different organisations databases, hence the existence of an API (courtesy of the GA4GH Data Working Group) to permit distributed queries between co-operating research organisations. Notable that there are Amazon Web Services and Google employees participating in this effort.
However, I wonder if we’re missing a big and potentially just as important data asset; that of the profile of bacteria that everyone is dependent on. We are each home to approx. 10 trillion human cells among the 100 trillion microbial cells in and on our own bodies; you are 90% not you.
While our human DNA is 99.9% identical to any person next to us, the profile of our MicroBiome are typically only 10% similar; our age, diet, genetics, physiology and use of antibiotics are also heavy influencing factors. Our DNA is our blueprint; the profile of the bacteria we carry is an ever changing set of weather conditions that either influence our health – or are leading indicators of something being wrong – or both. Far from being inert passengers, these little organisms play essential roles in the most fundamental processes of our lives, including digestion, immune responses and even behaviour.
Different MicroBiome ecosystems are present in different areas of our body, from our skin, mouth, stomach, intestines and genitals; most promise is currently derived from the analysis of stool samples. Further, our gut is only second to our brain in the number of nerve endings present, many of them able to enact activity independently from decisions upstairs. In other areas, there are very active hotlines between the two nerve cities.
Research is emerging that suggests previously unknown links between our microbes and numerous diseases, including obesity, arthritis, autism, depression and a litany of auto-immune conditions. Everyone knows someone who eats like a horse but is skinny thin; the composition of microbes in their gut is a significant factor.
Meanwhile, costs of DNA sequencing and compute power have dropped to a level where analysis of our microbe ecosystems costs from $100M a decade ago to some $100 today. It should continue on that downward path to a level where personal regular sampling could become available to all – if access to the needed sequencing equipment plus compute resources were more accessible and had much shorter total turnaround times. Not least to provide a rich Open Data corpus of samples that we can use for research purposes (and to feed back discoveries to the folks providing samples). So, what’s stopping us?
Data Corpus for Research Projects
To date, significant resources are being expended on Human DNA Genetics and comparatively little on MicroBiome ecosystems; the largest research projects are custom built and have sampling populations of less than 4000 individuals. This results in insufficient population sizes and sample frequency on which to easily and quickly conduct wholesale analyses; this to understand the components of health afflictions, changes to the mix over time and to isolate root causes.
There are open data efforts underway with the American Gut Project (based out of the Knight Lab in the University of San Diego) plus a feeder “British Gut Project” (involving Tim Spector and staff at University College London). The main gotcha is that the service is one-shot and takes several months to turn around. My own sample, submitted in January, may take up 6 months to work through their sequencing then compute batch process.
In parallel, VC funded company uBiome provide the sampling with a 6-8 week turnaround (at least for the gut samples; slower for the other 4 area samples we’ve submitted), though they are currently not sharing the captured data to the best of my knowledge. That said, the analysis gives an indication of the names, types and quantities of bacteria present (with a league table of those over and under represented compared to all samples they’ve received to date), but do not currently communicate any health related findings.
My own uBiome measures suggest my gut ecosystem is more diverse than 83% of folks they’ve sampled to date, which is an analogue for being more healthy than most; those bacteria that are over represented – one up to 67x more than is usual – are of the type that orally administered probiotics attempt to get to your gut. So a life of avoiding antibiotics whenever possible appears to have helped me.
However, the gut ecosystem can flex quite dramatically. As an example, see what happened when one person contracted Salmonella over a three pay period (the green in the top of this picture; x-axis is days); you can see an aggressive killing spree where 30% of the gut bacteria population are displaced, followed by a gradual fight back to normality:
Under usual circumstances, the US/UK Gut Projects and indeed uBiome take a single measure and report back many weeks later. The only extra feature that may be deduced is the delta between counts of genome start and end sequences, as this will give an indication to the relative species population growth rates from otherwise static data.
I am not aware of anyone offering a faster turnaround service, nor one that can map several successively time gapped samples, let alone one that can convey health afflictions that can be deduced from the mix – or indeed from progressive weather patterns – based on the profile of bacteria populations found.
My questions include:
- Is there demand for a fast turnaround, wholesale profile of a bacterial population to assist medical professionals isolating a indicators – or the root cause – of ill health with impressive accuracy?
- How useful would a large corpus of bacterial “open data” be to research teams, to support their own analysis hunches and indeed to support enough data to make use of machine learning inferences? Could we routinely take samples donated by patients or hospitals to incorporate into this research corpus? Do we need the extensive questionnaires the the various Gut Projects and uBiome issue completed alongside every sample?
- What are the steps in the analysis pipeline that are slowing the end to end process? Does increased sample size (beyond a small stain on a cotton bud) remove the need to enhance/copy the sample, with it’s associated need for nitrogen-based lab environments (many types of bacteria are happy as Larry in the Nitrogen of the gut, but perish with exposure to oxygen).
- Is there any work active to make the QIIME (pronounced “Chime”) pattern matching code take advantage of cloud spot instances, inc Hadoop or Spark, to speed the turnaround time from Sequencing reads to the resulting species type:volume value pairs?
- What’s the most effective delivery mechanism for providing “Open Data” exposure to researchers, while retaining the privacy (protection from financial or reputational prejudice) for those providing samples?
- How do we feed research discoveries back (in English) to the folks who’ve provided samples and their associated medical professionals?
New Generation Sequencing works by splitting DNA/RNA strands into relatively short read lengths, which then need to be reassembled against known patterns. Taking a poop sample with contains thousands of different bacteria is akin to throwing the pieces of many thousand puzzles into one pile and then having to reconstruct them back – and count the number of each. As an illustration, a single HiSeq run may generate up to 6 x 10^9 sequences; these then need reassembling and the count of 16S rDNA type:quantity value pairs deduced. I’ve seen estimates of six thousand CPU hours to do the associated analysis to end up with statistically valid type and count pairs. This is a possible use case for otherwise unused spot instance capacity at large cloud vendors if the data volumes could be ingested and processed cost effectively.
Nanopore sequencing is another route, which has much longer read lengths but is much more error prone (1% for NGS, typically up to 30% for portable Nanopore devices), which probably limits their utility for analysing bacteria samples in our use case. Much more useful if you’re testing for particular types of RNA or DNA, rather than the wholesale profiling exercise we need. Hence for the time being, we’re reliant on trying to make an industrial scale, lab based batch process turn around data as fast we are able – but having a network accessible data corpus and research findings feedback process in place if and when sampling technology gets to be low cost and distributed to the point of use.
The elephant in the room is in working out how to fund the build of the service, to map it’s likely cost profile as technology/process improvements feed through, and to know to what extent it’s diagnosis of health root causes will improve it’s commercial attractiveness as a paid service over time. That is what i’m trying to assess while on the bench between work contracts.
Nature has it’s way of providing short cuts. Dogs have been trained to be amazingly prescient at assessing whether someone has Parkinson’s just by smelling their skin. There are other techniques where a pocket sized spectrometer can assess the existence of 23 specific health disorders. There may well be other techniques that come to market that don’t require a thorough picture of a bacterial population profile to give medical professionals the identity of the root causes of someone’s ill health. That said, a thorough analysis may at least be of utility to the research community, even if we get to only eliminate ever rarer edge cases as we go.
Coming full circle
One thing that’s become eerily apparent to date is some of the common terminology between MicroBiome conditions and terms i’ve once heard used by Chinese Herbal Medicine (my wife’s psoriasis was cured after seeing a practitioner in Newbury for several weeks nearly 20 years ago). The concept of “balance” and the existence of “heat” (betraying the inflammation as your bacterial population of different species ebbs and flows in reaction to different conditions). Then consumption or application of specific plant matter that puts the bodies bacterial population back to operating norms.
We’ve started to discover that some of the plants and herbs used in Chinese Medicine do have symbiotic effects on your bacterial population on conditions they are reckoned to help cure. With that, we are starting to see some statistically valid evidence that Chinese and Western medicine may well meet in the future, and be part of the same process in our future health management.
Until then, still work to do on the business plan.
The core essence of most management books I read can be boiled down to occupy a sheet of A4. There have also been a few big mistakes along the way, such as what were considered at the time to be seminal works, like Tom Peter’s “In Search of Excellence” — that in retrospect was an example summarised as “even the most successful companies possess DNA that also breed the seeds of their own destruction”.
I have much simpler business dynamics mapped out that I can explain to fast track employees — and demonstrate — inside an hour; there are usually four graphs that, once drawn, will betray the dynamics (or points of failure) afflicting any business. A very useful lesson I learnt from Microsoft when I used to distribute their software. But I digress.
Among my many Business books, I thought the insights in Geoffrey Moores Book “Crossing the Chasm” were brilliant — and useful for helping grow some of the product businesses i’ve run. The only gotcha is that I found myself keeping on cross referencing different parts of the book when trying to build a go-to-market plan for DEC Alpha AXP Servers (my first use of his work) back in the mid-1990’s — the time I worked for one of DEC’s Distributors.
So, suitably bored when my wife was watching J.R. Ewing being mischievous in the first UK run of “Dallas” on TV, I sat on the living room floor and penned this one page summary of the books major points. Just click it to download the PDF with my compliments. Or watch the author himself describe the model in under 14 minutes at an O’Reilly Strata Conference here. Or alternatively, go buy the latest edition of his book: Crossing the Chasm
My PA (when I ran Marketing Services at Demon Internet) redrew my hand-drawn sheet of A4 into the Microsoft Publisher document that output the one page PDF, and that i’ve referred to ever since. If you want a copy of the source file, please let me know — drop a request to: [email protected].
That said, i’ve been far more inspired by the recent work of Simon Wardley. He effectively breaks a service into its individual components and positions each on a 2D map; x-axis dictates the stage of the components evolution as it does through a Chasm-style lifecycle; the y-axis symbolises the value chain from raw materials to end user experience. You then place all the individual components and their linkages as part of an end-to-end service on the result. Having seen the landscape in this map form, then to assess how each component evolves/moves from custom build to commodity status over time. Even newest components evolve from chaotic genesis (where standards are not defined and/or features incomplete) to becoming well understood utilities in time.
The result highlights which service components need Agile, fast iterating discovery and which are becoming industrialised, six-sigma commodities. And once you see your map, you can focus teams and their measures on the important changes needed without breeding any contradictory or conflict-ridden behaviours. You end up with a well understood map and – once you overlay competitive offerings – can also assess the positions of other organisations that you may be competing with.
The only gotcha in all of this approach is that Simon hasn’t written the book yet. However, I notice he’s just provided a summary of his work on his Bits n Pieces Blog yesterday. See: Wardley Maps – set of useful Posts. That will keep anyone out of mischief for a very long time, but the end result is a well articulated, compelling strategy and the basis for a well thought out, go to market plan.
In the meantime, the basics on what is and isn’t working, and sussing out the important things to focus on, are core skills I can bring to bear for any software, channel-based or internet related business. I’m also technically literate enough to drag the supporting data out of IT systems for you where needed. Whether your business is an Internet-based startup or an established B2C or B2B Enterprise focussed IT business, i’d be delighted to assist.