Laura Haas, IBM Fellow, has been recognized by the Anita Borg Institute and the Grace Hopper Conference for her outstanding contributions to technology. I am so happy to be here to hear her talk today on information integration!
Haas and her team are trying to tackle the problem of how do we get information to people when they need it? For example, if a doctor is treating a patient with cancer, she will need to find information on how this type of cancer has been treated in the past, how well the treatments have worked and access past patient records.
The challenges faced are that you have diverse data models, overlapping data, incomplete and often inconsistent data. Different people involved want different views of the data and needs and knowledge change over time.
In order to do data integration, you need to understand what is available, as well as what the data means or its intent. You have to set up the schema, figure out how to identify information about the same object and figure out what to do with missing or inconsistent data. You need to decide which problems you're trying to solve, and execute - and hope the customer doesn't come back and tell you that they really wanted something else entirely :-)
Dr. Haas started her career in 1981 at IBM and relational databases were just coming onto the scene. You no longer had to be a database wizard to write code to interact with a database - which broadened the concept of information integration. They even called it "eager integration" - as you could eagerly get as much data as you wanted.
She then started her work on the project R* (pronounced R-star), which was a distributed relational database management system. One query was allowed to access data in multiple, homogeneous relational DBMS. This type of system helped prevent data loss and helped to distribute queries and transaction management. While the project did not have much commercial success, it paved the way for a lot of work in database systems and future products for IBM and her own future research.
Relational database technology was growing rapidly in 1984, a very exciting time for those in the industry. Dr. Haas then joined the the Starburst team (no, not named for the "fruit" chews, but named as an extension of the R* project). This was an extensible relational DBMS that allowed many types of additions - new functions, optimizations, indexes, data types, and storage methods. The best part of this? This project had legs - and became foundation for IBM's DB2 "for workstations".
Several people that worked on this project ended up being named Fellows or Distinguished Engineers, though she notes it took her a lot longer to get Fellow than her male colleagues and she had to earn many more accolades. Dr. Haas recommends that you wrap yourself with the best team you can find, do not be intimidated if they are better or smarter than you are, as they will take you places!
Dr. Haas was able to take a sabbatical from IBM to study at the University of Wisconsin-Madison, where she studied with the brightest minds in database technology at the time (1992).
One of the new problems that needed to be solved in 1993, when she returned to IBM, was how to store images, videos and text that were starting to proliferate online. Digital libraries start to emerge in this time frame and they eventually will leverage relational DBMS. Customers were starting to want databases that could store multiple data types, so Dr. Haas and her team went to look back at concepts from R* and Starburst to solve the problem and started a new project... Garlic.
Why Garlic? Because Dr. Haas doesn't like acronyms, which IBM was famous for at the time, and she loves to cook. Garlic and chocolate being her favorite things - her old team thought if they renamed the team/project to Garlic, they'd get her to come back off of sabbatical. It worked!
Garlic was a data-less (object-)relational DBMS (aka virtual DBMS/federated DBMS). Had all the benefits of a high-level query language and all the features of the underlying data sources. This not only became a product for IBM, but started two separate business units (Life Sciences and InfoSphere Information Integration). Something that is very obvious listening to Dr. Haas speak is that once you find people you like working with - stick with them. You can do amazing things!
IBM was having trouble with integration, as people working in life sciences that were trying to work together wouldn't use the same database as their colleagues, so Dr. Haas's team worked on something called InfoLink to attempt to bridge this gap. Unfortunately the project was not a market success, but did help get IBM in the door at new customers and led to the InfoSphere suite - "a complete line of products for all your integration needs."
The longer Dr. Haas was at IBM, the larger her teams got - from a 10 person research team to a 120 person development organization and eventually to over 700 people (no team picture for that group... :-)
While this all sounds wonderful, there was still major problems that needed to be solved in 1999. As more people were adding federation to their systems, issues emerged. Set-up of federation was too slow and complicated, and while the development team had assumed users would be doing very simple joins/queries, but it turned out that complex queries were more the norm.
This lead to yet another project, Clio (not an acronym!) to do schema mapping by simply drawing lines! This opened up many more doors for IBM in the DBMS space and gave the researches many more ideas for future projects.
What impressed me most about Dr. Haas was the importance she gave to her team. She was so proud of each and every person she ever worked with, remembered their names and knew all about what they were doing now. Dr. Haas is clearly an amazing collaborator and it's not surprising that these brilliant people want to work with her.
What a phenomenal technical woman, very deserving of the Anita Borg Technical Leadership award!