Multi-threaded/-processed Requests to Cloud Services for Intelligent Address Standardization
Software productivity and sustainability are two key concerns when developing or modifying scientific software. Here, we present a case study on incorporating multithreading and multiprocessing into the extension of a software package for speeding up large-scale address standardization. For long-term sustainability, we have developed the research software with substantial code reuse of existing software, Google-style Python documentation and doctests.
Address standardization serves as an important preprocessing step to geocoding or a post-processing step to reverse geocoding. It is critical in various record linkage schemes of big data sources involving geographical fields. Numerous well-known address standardization software is available as cloud services, for example, usaddress, Data Science Toolkit, and Geocoder.us; their underlying models are based on intelligent machine learning methods such as neural networks trained from large samples of addresses in US or beyond.
In this case study, we extend a Python open-source software package of Choi, Lin, and Mulrow (2017) that serially tested accuracies and response time of the aforementioned clouds services on parsing large samples of clean and noisy addresses. Specifically, we design and implement multithreading and multiprocessing wrappers for issuing RESTful APIs in parallel. Our parallelized testing approach achieves an average speed up factor of more than 15 in execution time on machines with multiple cores.