How much has IBM’s Watson improved? Abstracts at 2015 ASCO

Every year around end of May, the American Society of Clinical Oncology (ASCO) holds an annual meeting. It is the most representative of oncologic societies, which eminent scholars in the field from all across the world participate. Although my specialty is not in oncology, I have been waiting for the meeting, during which abstracts on IBM’s Watson have been presented regularly since 2013. The reason I looked forward to the meeting so much is that it is difficult to obtain information on exactly what stage of development the product has reached aside from the rosy reports released by the company, as is the case in other products in digital healthcare. ASCO presents an opportunity to obtain relatively objective information on Watson’s level of development.

This time, without fail, four abstracts on research that applied IBM’s Watson to the field of oncology were presented. Through these, I attempted to obtain clues into the current stage of development for IBM’s Watson.

First, I introduce abstracts related to Watson that were presented in the previous ASCO meetings.

In the 2013 ASCO meeting, Memorial Sloan Kettering Cancer Center (MSKCC), which has been IBM’s partner since early on, presented an abstract titled, ‘Beyond Jeopardy!: Harnessing IBM’s Watson to improve oncology decision making.

Focusing on lung cancer, the authors evaluated natural language processing (NLP) and machine learning (ML). Watson was instructed with 525 actual lung cancer patient cases and 420 virtual patient cases.

The result is summarized in the table below.



The abstract claims that NLP, which is the ability to extract key pieces of information from patient cases, and ML, which is the ability to recommend adequate treatment plans, were both enhanced after repeated tests.

On examining the table above, it is hard to tell whether NLP is actually improving based solely on the results from Batch 1~7. However, as one can discern from Batch 8~16, if 300 cases are repeatedtly tested, the ability to recommend the correct treatment plan increases from 40% to 77%.

One may wonder whether it is appropriate to determine the ‘correct treatment plan’ based on the judgment of experts at MSKCC. If this is a result that numerous specialists from eminent cancer centers such as MSKCC have agreed upon, I think it is safe to do so. However, it may be hard to trust the diagnosis if it only has an accuracy of 77%. Still, given that the accuracy increased rapidly with repeated instruction, it is feasible that the reliability will increase even further with sufficient programming.


Even more abstracts were presented in the 2014 ASCO meeting.

First, MSKCC presented research that further expanded from that of 2013. It developed a similar model on colon cancer, rectal cancer, bladder cancer, pancreatic cancer, renal cancer, ovarian cancer, cervical cancer, and endometrial cancer in a study titled ‘Next steps for IBM Watson Oncology: Scalability to additional malignancies. Watson was repeatedly instructed with this model and the proportion of correct treatment plans that were recommended was investigated. .

It was shown that with repeated instruction, Watson increased in accuracy with respect to all types of cancers. However, the caveat was that the same patient cases were repeatedly tested, just as in the 2013 study. In order to have clinical significance, Watson needs to show outstanding diagnosing ability on new patient cases after it has been instructed with the patient cases. However, it has not yet reached this stage.

The MSKCC has produced another set of study results called Piloting IBM Watson Oncology within Memorial Sloan Kettering’s regional network. Oncologists in hospitals affiliated with the MSKCC network evaluated Watson’s ability to recommend correct treatment plans for breast, colon and rectal cancer and gave feedback on their experiences using Watson. Only 6 people were included in this study, so there may be limits to this study. Nevertheless, the users evaluated Watson to be helpful in their decision-making process. However, they claimed that too much time was spent on entering in patient data, as they were required to input over 20 unnecessary components. They pointed out that Watson should be able to collect data directly from preexisting materials, and that it will have to provide more evidence for its decision to rank treatment plans in a certain order. Thus we can say that although Watson is said to have natural processing power, it still has to recognize data that is entered in by the doctor.

Meanwhile, MD Anderson, another world-class cancer center, presented research on leukemia. According to an abstract titled MD Anderson’s Oncology Expert Advisor powered by IBM Watson: A Web-based cognitive clinical decision support tool, Watson was instructed with 400 leukemia patient cases and the treatment plans based on the treatment decisions of oncologists at MD Anderson. Overall, Watson’s accuracy reached 82.6%. It was not indicated whether Watson evaluated the same patient cases that it was instructed with or whether it was given a completely novel patient case following instruction. To me it seems like it was instructed in the same way as in the MSKCC study.

Based on research presented from 2013 to 2014, we may conclude that Watson’s natural processing ability does not yet meet expectations, and that although it has an excellent learning ability, its ability to apply its knowledge to novel patient cases has not been sufficiently verified.


Now, we will examine abstracts that were presented this year.

MSKCC, a long-time partner of IBM, once again presented an abstract this year. The abstract, which is titled Assessing the performance of Watson for oncology, a decision support system, using actual contemporary clinical cases, evaluated Watson’s ability to apply its knowledge to new patient cases (just as we have awaited). 20 thoracic cancer cases (the abstract stated that thoracic medical oncologists chose the patient cases, but considering the fact that the category within which this abstract was included was Lung cancer-Non-Small Cell Metastatic, these cases may be on non-small cell lung cancer)  who underwent treatment for the first time were used and each patient case was provided with sufficient diagnostic materials, including molecular pathology lab results. These cases were entered into Watson as structured attributes. Watson and MSKCC medical personnel were told to classify possible treatment options for each patient case into categories of ‘Recommended’, ‘For Consideration’ or ‘Not Recommended’. Subsequently, it was determined whether or not the decisions made by Watson and MSKCC medical personnel corresponded to one another.

50% of the options that were ‘Recommended’ by MSKCC medical personnel were also ‘Recommended’ by Watson. Of those ‘Recommended’ by MSKCC, 25% were ‘For Consideration’ and 25% were ‘Not Recommended’ by Watson. 16 cases consisted of metastatic lung cancer, and of the chemotherapy agents actually used by the MSKCC medical personnel, 88% were classified as either ‘Recommended’ or ‘For Consideration’ by Watson. There were cases of which the treatment option was “Recommended’ by MSKCC medical staff but ‘Not Recommended’ by Watson and these were the cases of elderly patients with co-morbidities that Watson had not yet learned.

The authors thus concluded that that Watson’s choice of options came within the boundaries of evidence-based medicine. They also claimed that Watson will be able to increase its accuracy by repeated training with medical personnel and further development. (Still, given the values above, it is questionable whether Watson truly comes ‘within the boundaries of evidence-based medicine.’) However, they pointed out that in the case of elderly patients with comorbidities that have heterogeneous treatment options, Watson still faces obstacles that it needs to overcome.

The MSKCC produced yet another abstract, which dealt with metastatic breast cancer. The research was titledSteps in developing Watson for Oncology, a decision support system to assist physicians choosing first-line metastatic breast cancer (MBC) therapies: Improved performance with machine learning. This abstract is rather difficult to understand, so I will attempt to restructure the information in it . For those who are unsure of my explanation, I recommend reading the original document.

Even when they have many characteristics in common (such as age, activity level, expression of receptors, and baseline treatment), metastatic breast cancer (MBC) patients tend to undergo very different types of treatment. This difference may be found in the choice of chemotherapy agent or in the selection of hormonal therapy agents. Doctors assign weightings to individual factors related to breast cancer (such as the location, extent, size of the tumor, and the severity of symptoms) and by combining these choose the best treatment plan. Researches have been conducted on how to best assign weightings as to help predict a patient’s prognosis and come up with treatment decisions.

By showing Watson on how MSKCC specialists made decisions specific for each case of breast cancer, this research aimed to instruct Watson how specialists assign weightings and thereby improve its ability to recommend treatment options. 101 manufactured MBC cases were used for this process of instruction.

As a result, when all 101 cases were evaluated, the accuracy improved from a pre-instructional value of 73.6% to a post-instructional value of 82.1%, which represented an increase of 11.5%. When the cases were analyzed based on HR and HER2 status, an increase of 28.8% for HR+HER2+, 9.6% for HR+HER2-, and 2.8% for HR-HER2+ were reported, while a decrease of 1.4% was reported for HR-HER2-. (For those of you who are not familiar with breast cancer, HR and HER2 mean hormone receptors crucial to breast cancer physiology and treatment)

Ultimately, Watson’s mechanical learning model was able to make decisions that more closely resembled those of MSKCC specialists when instructed with manufactured cases and decision-making logic than when instructed with just algorithms (originating from preexisting guidelines).

The results of this study are very surprising. Many people, including myself, thought that Watson was only capable of  handling clinical situation based on preexisting research results and guidelines, and that it would not be able to recommend treatment plans for complex cases not yet sufficiently addressed in published papers or textbooks. Thus, it was unfathomable that tacit knowledge learned through experience by the best of experts could also be acquired by Watson. However, it now seems likely that with appropriate instruction, Watson will be able to reproduce exactly that kind of knowledge. If Watson’s ability to assign weightings in clinical decision-making is improved, this means that it will ultimately reach a stage where it can generate medical evidence and guidelines on its own.

The MSKCC produced three sets of abstracts, the last of which is on early breast cancer. Its title is Integration of multi-modality treatment planning for early stage breast cancer (BC) into Watson for Oncology, a Decision Support System: Seeing the forest and the trees.

While metastatic breast cancer is only treatable with chemotherapy and hormone therapy, early breast cancer requires surgery and may additionally necessitate axillary lymph node dissection or radiotherapy. In the cases where the cancer is of genetic origin, one may need to seek a consult for genetic counseling. Depending on treatment options, one may also need to seek a consult for fertility preservation, which could be potentially harmed or altered by chemotherapy agents. Given that multidisciplinary team intervention, in which doctors of diverse specialties partake in treatment, is becoming even more widespread, one will need to determine whether Watson is capable of playing the role of the primary oncologist.

MSKCC’s breast cancer specialists evaluated how well Watson can seek consults for lymph node dissection (BS), radiotherapy (RT), clinical genetic counseling (CG) and fertility preservation (FP) after instruction.

When compared to expert opinion, Watson’s ability to seek consults for RT matched 98% of the time; for CG, 94% of the time; and for FP, 91% of the time. Furthermore, in terms of BS, it recommended surgery for all 8 of the cases where an expert determined that surgery was necessary. Of the 12 cases where surgery was not recommended, Watson recommended surgery in 7 cases. The authors concluded that Watson’s performance was quite excellent.

The last abstract was presented not by the MSKCC but by the BC Cancer Agency (BCCA) Genomic Sciences Centre of Vancouver, Canada. It has the title, Implementation of Watson Genomic Analytics processing to improve the efficiency of interpreting whole genome sequencing data on patients with advanced cancers.’ The ability to provide information and therapeutic modalities based on the results of genetic analyses to assist with treatment was evaluated. While it took a human being over 10 days to complete this task, Watson finished its analysis within a matter of minutes (the abstract includes many anecdotes on genetic analyses that I have not discussed here because I am neither very knowledgeable nor interested in them). When it states that Watson is capable of analyzing big data quickly, this abstract presents nothing new, so I will not discuss it further.

I would like to share my thoughts on the first three abstracts presented at 2015 ASCO meeting. It is unclear whether the same cases were consistently tested as in the abstracts presented at previous meetings, or whether a novel case was applied after repeated instruction. Since no additional comments were made, the former appears to be the case. In that case, it may be premature to apply Watson to the clinical setting.

According to reports by IBM in October, 2014 (the link can be found here), a 5-year contract was made to use Watson in oncology in Thailand’s Bumrungrad International Hospital. If Watson is still at the stage where it is working on improving accuracy with repetitive instruction, it does not make sense to apply Watson to the clinical setting in Bumrungrad Hospital. This made me curious more about Watson’s current level of ability and its purpose in Bumrungrad. (Because I was curious, I even studied Bumrungrad’s annual report for 2014, but it only mentions the implementation of Watson, and nothing more specific. For your information, Bumrungrad Hospital is listed in the stock market.)


I was most surprised by the content of the second abstract. As I mentioned previously, it implied that Watson is not only capable of organizing existing information that is scattered, but can also combine this to synthesize new medical knowledge.

Overall, it seems that we are still far from implementing Watson in the clinical setting of cancer treatment (based on only the content of the presented abstracts). Underestimating the power of technology often leaves us sorry, but there currently seem to be limits to using Watson, even to simply assist doctors.  Still, there is no doubt that Watson will continue to improve and will eventually end up playing the role of the doctor in a majority of cases.



Leave a Reply

Your email address will not be published.