CHALLENGE.

Our customer has an Online Construction Bidding Platform to help contractors, manufacturers and distributors succeed with the largest, most accurate database of construction projects in the industry with data analytics. Our customer gathers data from diverse and disparate sources such as website, content aggregators, phone calls, emails, web forms, and web services both internal (receipt) and external (acquisition).

Every data source is different, and the level of structure varies from highly structured, fielded data down to completely unstructured text-based content. Data acquisition also involves different document types that include PDF with and without text layers, images, DWG, BIM, DOC, XLS, CSV, HTML, XML and plain text.

SOLUTION.

Developers from Data Cuve , analyzed the exact requirements and developed an automated data extraction solution by integrating PDF text extraction engines and OCR SDK’s (Optical Character Recognition) which extracts and modifies text, table data, raster images and vector images into XML.

The extracted data were indexed and stored as searchable PDF, so that whenever an internal user interacts with the system through the developed web interface to search for data in the documents, it is ensured that the system delivers a more accurate search result using the NLP rule set created.

The developed automated solution also updates the bookmarks and table of contents on the PDF’s by analyzing the content present inside the PDF. The provided solution is equipped with batch processing capabilities to ensure that a huge volume of documents can be fed into system to get processed.

RESULT.

  • Gather data from diverse and disparate sources
  • Automate processes involving text and information extraction from construction-based PDF files
  • Extraction of unstructured data from different document types, indexed for searchable one with in few seconds
  • Generate reports from visualization dashboard and provides visualization charts for aggregated information