Corn and soybean breeders must analyze vast amounts of data to bring new hybrids and varieties to the market. In 2001, Pioneer generated about one million data points for the year. In 2010, the company collected one million data points a day.
Steve Thompson remembers the day, early in his career, when computers became an integral part of the plant-breeding program. “I was working with a small seed company, and we purchased one IBM XT computer. It cost $5,000 and was supposed to be enough to manage our corn-, soybean- and wheat-breeding program as well as our nurseries,” he says.
Today Thompson, who is global seeds and traits research and development leader for Dow AgroSciences, is planning a new data center at the company’s headquarters in Indianapolis that will handle all the data coming from the breeding program and the next-generation gene-sequencing program. “Instead of one computer, we now talk about clusters of supercomputers to analyze our data,” Thompson says.
One million data points
It is truly mind-boggling to talk about the vast amounts of data that today’s plant breeders generate and analyze. The myriad of tools available — molecular markers and gene sequencing, for example — generate a tremendous amount of data on their own. Combine this with all the on-farm data from around the world, and it’s easy to see how the data mountain grows.
“In 2001, we generated about one million data points for the year, which was at the time a tremendous achievement,” says Lane Arthur, vice president and chief information officer for Pioneer. Fast forward to 2010, and Pioneer was collecting and analyzing one million data points per DAY.
“There are two drivers on the research side that result in this data avalanche,” Arthur says. “One is the very nature of crop breeding. You have to grow crops in your local area and collect data in many locations.”
The second driver is the tools now available that generate tremendous amounts of data. “The biggest key to this is our ability to use molecular markers to understand our germplasm,” Arthur says. “And that data is being collected from our 80 years of corn germplasm.”
Much like a producer will collect and download harvest data from a combine monitor, researchers are collecting data, research notes, yield and environmental information, and any other pertinent information and sending it to a main computer hub where the data are again analyzed and available to researchers throughout the company. “We used to wait until the evening and send our field data into the main office at night,” Thompson says. “Now, in most cases, that information is transferred as soon as it is collected in the field.”
Data have always been key tools for plant breeders. How a plant reacts to environmental stress and agronomic practices, and how it yields helps a new hybrid or variety advance to commercialization.
“We have always been breeding for yield, strong roots, stalks and drought tolerance for decades,” says Dusty Post, global corn technology lead for Monsanto. “The difference here is that we are making progress much faster because of the tools we have available. I compare it to the cell phone. The first cell phones were big and bulky, and that’s the way breeders were working. With markers and high-definition breeding, it’s the smartphone equivalent. Breeders can use the data they are generating to do things faster with more depth and breadth that ever before.”
What breeders are doing now is using the same data, only much more of it, to better predict which lines should be advanced in the breeding process. And with the help of computers and number-crunching software, they can analyze many more lines with better results.
“The advancement of computer technology goes hand in hand with the advancement in plant breeding,” Arthur says. “Without these computers, we would not be able to analyze all the data. Plant breeders would be buried under the mountains of data and not be able to process it fast enough.”
The amount of data is expected to grow even larger in the coming years. “We have a team that continually analyzes just how much data that may be used in the future to predict computer needs,” Thompson says. “We know that our system is big enough today, but we need to ensure we are ready for the future.”
Pioneer’s data center that was once wall-to-wall with huge, bulky mainframe computers now stores banks of advanced servers that run three to five operating systems at one time and crunch data at amazing speed. “Computers are definitely getting smaller,” Arthur says. “But our data footprint is getting much larger, and we continually work to ensure we have the means to handle the data. It’s a very big challenge.”
In fact, today’s data pile could be dwarfed by what’s around the corner. “It’s hard to say that today’s data is the tip of the iceberg because it is so big now,” Arthur says. “But there are even greater plant-breeding technologies coming. We now have the ability to sequence every corn plant at the molecular level, and as that cost comes down, we will have more of that data.”
How much data? The corn genome notation is 2 x 10 to the 9th base pairs. That’s 2 billion base pairs of information in every plant. Combined with field data analysis of soils, moisture, fertility, and sunlight for each hybrid or variety, and the data mountain becomes Mount Everest.
“Advances in computer technology add benefit to our plant breeding,” Arthur says. “Our need for computers to analyze this data will only grow.”