On the Ragged Edge of a Revolution

Big Data Tools that Advance Quality and Improve Economics in Papermaking


There is a significant opportunity for improving quality and economic performance in paper manufacturing through application of Big Data analysis. Obstacles are being overcome, but improvements in complementary analytical tools are still needed and are in development. Through these tools, mills can reap significant benefits.

The author and his team, comparing longer term data to a current process output. From center, counterclockwise around keyboard: Peter Hart, Adele Panek, John DeJarnette, and Nichole Kilgore.

Fig. 1: Green denotes the stage is in relatively good shape, yellow means “currently possible but needs some improvement,” red means “needs significant improvement.”

The term Big Data was first documented in a 1997 paper by NASA to describe data-sets that were large enough to tax the capacity of mainframe machine memory, local disk space, and remote disk space. More recently, Big Data has become an all-encompassing term for stored real-time data too large and complex to process with manual data management tools or traditional processing methods.

In the paper industry, Big Data usually means a set of real-time operating, performance, and quality parameters spanning several months or more. Buried in the data are important correlations that, if understood, can provide operating and troubleshooting guidance that can lead to performance improvement.

Figure 1 shows a series of specific steps required for Big Data projects. If performed well, Big Data projects will result in substantial cost savings and quality improvements for the mill.

When real-time data historian systems, like PI and PARCview, were first implemented, the data were typically used in small snapshots to answer specific process and cost issues or answer a customer’s complaint1. Even though mills have had large data-sets available to them, they have not embraced using Big Data routinely. Why not?

Buried Data: Data historians usually produce an amorphous mountain of ever-changing data. One significant impediment to the adoption of Big Data as a routine process/product improvement tool has been the difficulty of extracting the relevant data and discerning patterns in the data. The paper industry has lacked sufficiently powerful computing tools, protocols and statistical analysis techniques to assist operators and production engineers with Big Data retrieval and analysis.

As data processing tools and methodologies have improved, it has become apparent that, within that information mountain, rich veins of signals are buried. When properly analyzed and understood, these signals can lead to substantial improvements in daily operations, improved product quality, reduced energy demands, and reduced costs. In short, there is gold buried within all the data.

Unfortunately, just like the valuable ore hiding within real mountains, the useful data are often contained within a significant overburden, which must be removed to get to the desired material.

System Design Shortcomings: Vendors have been trying to make it easier to analyze long-term data by incorporating relational databases into the data historian. Data historians do an excellent job storing and compressing data in a time-series format. Unfortunately, relational databases do not handle raw, unfiltered time series data particularly well. Moreover, good data are often interspersed with bad data because of drifting, scaled or failed sensors, production downtime, data transmission issues, or inaccurate or missing manual entries.

Sorting the good data from the bad data can become a Herculean task. One recent mill study2 investigated root causes of paper indents using several years of data retrieved from a PI system. A small team of people needed multiple PCs over several days to pull the data from the historian, followed by several weeks of scrubbing the data to obtain clean working files. After all this effort, the team was finally able to begin the analysis.

Another example of the difficulty of extracting usable data was found in a batch digester mill investigating changes in production resulting from screen modifications over a 3-year period. It was even difficult to determine something as simple as the number of pine and hardwood digester blows in a specified period. This mill has three different systems recording digester blows, all of them using the same blow initiation signal from the distributed control system (DCS) as a time-stamp.

One system is the DCS, which initiates the signal to start the digester blow; a second system is a mill production/accounting system (used by the financial groups to determine mill productivity and costs), which obtains blow information from the DCS through a programable logic circuit (PLC); a third system is the mill data historian. The historian time-stamp was also obtained from the DCS.

After two days of focusing exclusively on the number of blows, the team determined that none of the three systems agreed on something as simple as the number of digesters blown per day, month, or even each quarter. All three systems were using the same identifier to record the event, yet the data did not agree.

Hybrid Systems: As the volume of data contained within data historians has increased over the years, several groups (such as PARCview, OSI PI, and others) have attempted to graft relational databases to the data historian to create hybrid analysis systems. These hybrid systems are significantly more powerful than the stand-alone systems.

Hybrid systems of this sort work quite well for real-time data trending and short-term process upset analyses. Unfortunately, they are not yet very good at the retrieval and analysis of long-term Big Data sets due to the extensive cleanup requirements of Big Data sets. These systems have also become quite good at trending real-time process data over a limited timeframe and providing finite snapshots of process performance and variability.

Bolt-on Tools: To address the difficulty of extracting the “valuable ore” data from the “overburden,” several companies have recently developed bolt-on software packages designed to work with the data historian, e.g., Wedge by Savcor Oy. Programs of this type rapidly collect, cleanse, and analyze “Big Data” sets. These programs provide fast, online process diagnostics systems that enable massive amounts of process information to be extracted from various sources (including the data historian) for analysis. Time delayed process models can be developed, and correlations and probable root causes of various process problems can be determined.

These programs allow for exceptionally rapid cleanup of inaccurate data values. Special events such as outages can be easily eliminated from the data. For the indent example mentioned here, the data-cleansing (which required several weeks when performed manually with spreadsheets) was easily accomplished in a couple of hours.

These programs have also been used for rapid analysis of long-term trial data. A specific example involved the impact of three different types of refiner plates on paper quality. A process engineer was able to obtain plate run-life data over a two-year span in less than 15 minutes, and use this data to categorize paper quality data by paper grade and plate pattern. As may be seen in Figure 2, there are substantial differences in plate life among the plates tested; more significantly, statistically significant basis weight differences were also detected. This is actionable information guiding operations toward a quality and cost improvement opportunity.

The advent of these new data acquisition and analysis tools has not eliminated the need for process engineers with a deep understanding of the process. The tools are developing for modeling and correlating various operating parameters over the entire mill—but, of course, correlation may not correspond to causation and the models may be inaccurate. Experienced engineers familiar with the processes being evaluated must still interpret and act on the signals discerned.

For example, Figure 2 shows significantly different run-hours among the refiner plates, but an engineer still had to determine whether there were any special events that impacted the data, evaluate the impact of the different plates on fiber and resulting paper properties, and estimate the overall economic impact of the different plates. Once these questions were answered, the engineer could then determine the total cost of ownership for each style of plate, identifying a cost optimization opportunity for the mill. In the example depicted in Figure 2, these questions were rapidly answered.

Fig. 1: Green denotes the stage is in relatively good shape, yellow means “currently possible but needs some improvement,” red means “needs significant improvement.”

These new tools provide enough computing power to allow mill engineers to see data trends across the entire mill and over several years. Unfortunately, when dealing with data-sets of this size, in conjunction with the computing power necessary to successfully and rapidly correlate these types of exceptionally large data sets, old methods of analysis and statistical manipulation must be replaced with new and improved capture and analysis techniques.

For instance, in the paper indents example2, a significant amount of effort was spent to determine times of poor operational performance and compare those “poor” times with periods of “good” operational performance.
With these new tools, it is important not to review data only from “bad” times, however defined, as compared only to “good” times. Data from intermediate operating condition times contains valuable signals as well. Therefore, the objective of data-sorting is to define realistic parameter ranges and eliminate only truly bogus values.

A cautionary note: It is no longer appropriate to assume normal distributions of data, as has been common; with so much data, we know the distribution. We know the process has been altered with both manual and automated control parameters, the data distribution is specifically not “normal,” and different statistical approaches for analysis of these non-normal data are required.

The use of Big Data is on the cusp of becoming a powerful tool in the identification and elimination of long-term, systematic operational issues within the pulp and paper industry. Big Data techniques allow engineers to determine the true interactions among different areas of the mill, to highlight behaviors in one part of the mill that have negative or positive impacts upon other parts of the mill, and to act as indicated. To fully realize the potential of Big Data, the industry must develop and embrace significant paradigm shifts in data capture, sorting, analysis protocols, and statistical tool packages.

Improvements in computing power, plus software programs that allow data manipulation and cleansing, have recently supported performance of several limited Big Data studies—usually with positive impacts to the mill’s bottom line, and always resulting in new insights. Use of non-normal statistical techniques and improved engineering understanding of Big Data analysis protocols and methods should lead to significant increases in performance. We are truly at the ragged edge of the Big Data revolution.

The author gives special thanks to Kathleen Bennett and Humphrey Moynihan for editorial insights and improvements to this article.

1. M. Sandin, “Operational Data Helps Mills Meet Challenges,” Paper360°, Nov/Dec 2017, pp. 34-36.
2. J. Fu, P.W. Hart, “Leveraging Mill-Wide Big Data Sets for Process and Quality Improvement in Paperboard Production,” TAPPI Journal, Vol. 15, No, 5, 2016, pp 309-319.

Peter W. Hart is director, fiber science and technology for WestRock in Richmond, VA. He and his co-authors were awarded the 2016 TAPPI Journal Best Paper Award for his research paper “Leveraging Mill-Wide Big Data Sets for Process and Quality Improvement in Paperboard Production,” cited in this article. Reach Hart at [email protected].