77 Geosemantic Surveillance and Profiling of Abduction Locations and Risk Hotspots Using Print Media Reports Toyib Ogunremi, Olubayo Adekanmbi, Anthony Soronnadi & David Akanji Data Scientists Network Foundation Toyib@datasciencenigeria.ai, Olubayo@datasciencenigeria.ai, Anthony@datasciencenigeria.ai & David@datasciencenigeria.ai Abstract Kidnapping poses a significant social risk in Nigeria, often exacerbated by the lack of local crime data, underreporting of cases, and potential involvement of security operatives. Our research aims to combat this menace by developing a data-driven solution that offers comprehensive insights into crime locations and entities. We have generated a reliable dataset by geoparsing newspaper-reported crime locations and entities using Natural Language Processing (NLP) techniques and Google geocoder. Additionally, we implemented clustering and geospatial analysis to identify social risk hotspots. Our method involves designing an algorithm that can geoparse locations in unstructured raw text. The results of our research provide crucial insights and solutions for addressing the threat of kidnapping in Nigeria. We recommend the implementation of our data-driven approach as an intervention strategy to aid law enforcement and policy makers. Our study contributes to the understanding of the spatio- temporal dynamics of kidnapping cases in Nigeria. Introduction Geo-semantics is an interdisciplinary field that combines Natural Language Processing (NLP) which enables computers to understand human language and Geospatial techniques which analyze data with respect to space. The process of converting unstructured text into structured geospatial information is Geoparsing which involves Toponym recognition using NER techniques from NLP and Toponym resolution using Geocoding tools which resolves the extracted locations to the global space (Hu, 2022). This study aims to apply Named Entity Recognition (NER) and geospatial analysis to generate a detailed dataset of crime incidents in Nigeria, contributing to crime analysis and intervention initiatives in crime plagued countries. ©Machine Intelligence Research Group, University of Lagos V. Odumuyiwa et. al. (Eds.): MIRG-ICAIR 2023, pp. 77–82, 2023. Related works In 2010, MorphoSyntactic Parser, was proposed to provide input to WikiCrimes, a web-based platform for recording aggregated crime information in Portuguese text (Pinheiro et. al., 2010). Asharef et al (2012) proposed a rule-based Arabic NER approach for crime-related entity extraction using morphological information, predefined and general crime indicator lists, and Arabic named entity corpus (Asharef, Omar & Albared, 2012). Arulanandam et al (2014) used Conditional Random Field (CRF) to classify theft-related sentences and extract crime locations from news articles in New Zealand (Arulanandam, Savarimuthu & Purvis, 2014). Shabat and Omar (2015) employed classification algorithms for crime NER and type identification tasks (Al- Geosemantic Surveillance and Profiling of Abduction Locations and Risk Hotspots Using Print Media Reports 78 Shoukry, & Omar, 2015). Goncalo C. et al (2021) trained an NER module to identify entities such as person, organization, location, and date from Portuguese police reports and online news about crime (Carnaz, Antunes & Nogueira, 2021). Habib et al. (2020) extracted 900 crime data from 8years news archive of Pakistan using NLP and performed hotspot based spatial analysis to predict the behavior of criminal networks using two different classifiers namely K- Nearest Neighbor (KNN) and Random Forest algorithm (Habib et. al., 2020), while these studies have contributed significantly to crime analysis and mitigation in places all over the world, there is a call for replication in Nigeria as languages varies, this study therefore aims to use reliable and reputable print media to wet the floor in the Nigerian context for future investigation into the automatic crime surveillance techniques for a better and efficient crime interventions. Methodology This study involved Natural language processing and exploratory data analysis (Geo-spatial EDA) of ten years' worth of kidnapping news articles from the Punch newspaper webpage, one of the top daily news outlets in Nigeria as described below: 3.1 Data Acquisition The research methodology involved the collection of data from the Punch online website, Web scraping techniques was utilized to extract the hypertext markup language (HTML) content from the Punch newspaper webpage. We extracted kidnap news report data using the 'request library,' a standard python module for sending HTTP requests, and the 'BeautifulSoup library' for parsing the returned HTML structure. The extracted information, including the publishing date, news headline, and full content, was stored in a pandas dataframe for analysis (Oyelere, 2023). 3.2 Filtering The selection process involved filtering out articles that were not directly related to kidnap incidents, such as those covering rescue operations or general discussions of kidnappings. To achieve this, we used the spacy 'en_core_web_trf' a transformer model to perform Part of Speech (POS) tagging and Dependency parsing on all retrieved headlines. This allowed us to identify the root verb of each headline and lemmatize it to remove inflections. We then checked whether the resulting word was synonymous with "kidnap", yielding synonyms such as "abduct" and "whisk-away". a. Data Preprocessing: Entity Recognition and Information Retrieval, to identify crime entities mentioned in news articles. Firstly, we have segmented each article into sentences and analyzed the syntactic relationship between elements of each sentence. To identify the victims, we have considered the words with a direct object relationship with the verb "kidnap" and its synonyms in each sentence. Furthermore, we have extracted all the elements present in the subtree to account for compound objects. The same approach has been employed to extract the kidnappers using nominal subject relationship with the "kidnap" verb and its synonyms in the sentence. b. Geoparsing i. Toponym recognition: After segmenting the sentences and filtering for those related to kidnappings, we extracted all locational phrases introduced by a locational preposition. This step was necessary to avoid Ogunremi, Adekanmbi, Soronnadi & Akanji 79 mistakenly including locations mentioned in unrelated sentences. We then removed all stop words in every entity recognized and filtered for capitalized entities that met the criteria for a proper noun. Finally, we combined all split locations as a single string. ii. Toponym resolution: We used the Google geocoder through the geopy library to geocode the locations identified, this works well with our locations in that, it was able to resolve and yield geographic coordinates of the identified locations despite not being arranged properly, it was also able to resolve street level location to obscure places and mis- spelt location names. 3.3 Exploratory Data Analysis The publishing date of each article was assumed to be the date of the kidnapping event because most news articles are published online within a few hours of the event. The generated dataframe was then converted into a geodataframe to perform geospatial analysis. The Nigerian state and local government administrative boundaries were spatially joined with the kidnap geodataframe using the 'within' operation parameter of the Kidnap (report) News Data Acquisition (Online Newspaper Scraping) Data Classification (Filtering) Rescue operation Reports, Public commentary, and Editorials on Kidnapping Data Preprocessing (NER) Exploratory Data Analysis Information Retrieval (IR) o Victims o Kidnapper Geoparsing 1. Toponym recognition 2. Toponym resolution o Cluster of risk hotspots o Crime entities aggregation o Visualization of resolved locations. Figure 1 Diagram showing the proposed workflow. Geosemantic Surveillance and Profiling of Abduction Locations and Risk Hotspots Using Print Media Reports 80 “geopandas sjoin” method to capture the boundary of crime locations. Figure 1 shows a diagram of the proposed workflow. We then proceed to perform a hotspot analysis of aggregated reported locations. Results and Discussion Our study generated a dataset of crime entities and location of over 1200 reports which can be a starting point for researchers in exploring this area of national security, a sample of the dataset generated is shown in table-1, which is currently being open- sourced to the NLP-community in the Hugging Face platform. We also revealed the spatio-temporal analysis of kidnapping cases in Nigeria, from the clustering of crime locations as shown by the choropleth diagram in Figure 2, with Kaduna State emerging as the most affected region. We observed that Kaduna’s high kidnapping rates could be attributed to factors such as overpopulation and high cost publish_date headline victims kidnapper address 5/9/2023 Bandits kidnap 40 Kaduna worshippers, nine emir’s children 40 Kaduna worshippers, nine emir ’s children Bandits Chikun, Kaduna, Nigeria 4/7/2023 Gunmen kidnap Nasarawa ex-deputy governor, Gye- Wado Nasarawa ex- deputy governor, Gye-Wado Gunmen 960134, Wamba, Nasarawa, Nigeria 3/12/2023 Gunmen kidnap nine in Abuja estate nine Gunmen Grow Home Estate, Bwari Area council, 901101, FCT, Nigeria Figure 2: Choropleth analysis of aggregated risk locations by state Table 1: A sample of the Generated dataset. Ogunremi, Adekanmbi, Soronnadi & Akanji 81 of living, as noted in other high-risk states like Lagos, Rivers, and Federal Capital Territory FCT (Ikenwa, 2023). Of all the six geopolitical zones in Nigeria, the Northwestern and the North central part are facing an increasing rate of kidnapping occurrences as shown from our analysis in Figure 3, this is due to the multi-dimensional crisis facing the two regions resulting in a widespread citizen displacement across the two regions (International Organization for Migration (IOM), 2023). Significance and Limitations Our primary objective is to address the issue of underreported cases by automating the documentation process. By doing so, we hope to promptly notify security personnel about ongoing incidents using a real time dashboard contrary to the traditional logging of statement by victim relatives at the security stations which is not timely and efficient as shown in Figure 4 below, this eliminates the need for victims’ relatives to give statements at the security stations. We hope this enable rapid deployment of emergency responses to the reported locations and strategic security operations. However, recent changes in Twitter's data scraping policy have posed challenges. 0% 50% 100% % o f C as es Year Yearly Kidnapping rate distribution by region S W S S Figure 3: Yearly kidnapping rate proportion of the six Nigeria geopolitical zones. Figure 4: Dashboard tracking real time reported kidnapping cases. Geosemantic Surveillance and Profiling of Abduction Locations and Risk Hotspots Using Print Media Reports 82 Unfortunately, we are unable to directly access a wealth of data from the public, even though Twitter is a significant source of reports on this social risk therefore we resort to using reports from Newspaper outlets. Future work will focus on improving feature extraction from newspaper articles beyond the syntactic entities to circumstances of crime events and other crime types like theft, rape, riot, murder etc. while also extending this surveillance technique to other major African languages like Hausa, Igbo and Yoruba which can finally be augmented with a speech to text pipeline to handle audio reported crimes. List of References Al-Shoukry, S., & Omar, N. (2015). Arabic Named Entity Recognition for Crime Documents Using Classifiers Combination. International Review on Computers and Software, 10(6). https://doi.org/10.15866/irecos.v10i6 .6767 Arulanandam, R., Savarimuthu, B. T. R., & Purvis, M. (2014). Extracting crime information from online newspaper articles. Asharef, M., Omar, N., & Albared, M. (2012). Arabic named entity recognition in crime documents. Journal of Theoretical and Applied Information Technology, 44(1), 1–6. Carnaz, G., Antunes, M., & Nogueira, V. B. (2021). An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing. Data, 6(7), 71. https://doi.org/10.3390/data6070071 Habib, U., Areeba, U., Sarfraz, S., Ahmad, M., Ullah, M., & Mazzara, M. (2020). Spatiotemporal Analysis of Web News Archives for Crime Prediction. Applied Sciences, 10. https://doi.org/10.3390/app10228220 Hu, X., Zhou, Z., Li, H., Hu, Y., Gu, F., Kersten, J., Fan, H., & Klan, F. (2022). Location reference recognition from texts: A survey and comparison. 1, 1 (July 2022), 35 pages. https://doi.org/10.1145/nnnnnnn.nnn nnnnn Ikenwa, C. (2023). 10 Most Expensive Cities in Nigeria to Live in (Cost of Living 2023). Nigerian Infopedia. Retrieved from https://nigerianinfopedia.com/10- expensive-cities-in-nigeria-to-live-in/ International Organization for Migration (IOM) (2023). DTM Nigeria – North- central & North-west Flash Report 148 (05 – 11 June 2023). IOM, Nigeria. Oyelere, J. (2023). Scraping News Articles on Kidnapping using Python & BeautifulSoup. Medium Blog Post. Retrieved from https://medium.com/@joyelere/scrap ing-news-articles-on-kidnapping- using-python-beautifulsoup- e5970adec709 Pinheiro, V., Furtado, V., Pequeno, T., & Nogueira, D. (2010). Natural language processing based on semantic inferentialism for extracting crime information from text. 2010 IEEE International Conference on Intelligence and Security Informatics. https://doi.org/10.1109/isi.2010.5484 783