Implementation of Web Scraping on Google Search Engine for Text Collection Into Structured 2D List

Tresna Maulana Fahrudin; Prismahardi Aji Riyantoko; Kartika Maulida Hindrayani

doi:10.31315/telematika.v20i2.9575

Authors

Tresna Maulana Fahrudin Department of Data Science Faculty of Computer Science UPN "Veteran" Jawa Timur http://orcid.org/0000-0002-9895-2442
Prismahardi Aji Riyantoko Department of Data Science Faculty of Computer Science UPN "Veteran" Jawa Timur
Kartika Maulida Hindrayani Department of Data Science Faculty of Computer Science UPN "Veteran" Jawa Timur

DOI:

https://doi.org/10.31315/telematika.v20i2.9575

Keywords:

web scraping, google search engine, text collection, structure of 2D list, parsing HTML

Abstract

Purpose: This research proposes the implementation of web scraping on Google Search Engine to collect text into a structured 2D list.

Design/methodology/approach: Implementing two important stages in the process of collecting data through web scraping, namely the HTML parsing process to extract links (URL) on Google Search Engine pages, and HTML parsing process to extract the body text from website pages on each link that has been collected.

Findings/result: The inputted query is adjusted to the latest issues and news in Indonesia, for example the President's important figures, the month of Ramadan and Idul Fitri, riots tragedy (stadium) and natural disasters, rising prices of basic commodities, oil and gold, as well as other news. The least number of links obtained was 56 links and the most was 151 links, while the processing time to obtain links for each of the fastest queries was 1 minute 6.3 seconds and the longest was 2 minutes 49.1 seconds. The results of scraping links from these queries were obtained from Wikipedia, Detik, Kompas, the Election Supervisory Body (Bawaslu), CNN Indonesia, the General Election Commission (KPU), Pikiran Rakyat, and others.

Originality/value/state of the art: Based on previous research, this study provides an alternative to produce optimal collection of links and text from web scraping results in the form of a 2D list structure. Lists in the Python programming language can store character sequences in the form of strings and can be accessed using index keys, and manipulate text efficiently.

Author Biography

Tresna Maulana Fahrudin, Department of Data Science Faculty of Computer Science UPN "Veteran" Jawa Timur

Department of Data Science
Faculty of Computer Science
UPN "Veteran" Jawa Timur

References

S. Praveen and U. Chandra, “NoSQL Products : IT Giants Perspectives,” Int. J. Comput. Intell. Res., vol. 13, no. 8, pp. 2125–2133, 2017.

W. G. Swajati, “Kajian Kebijakan dan Sistem Pengelolaan Data Penelitian Indonesia,” Jakarta, 2021.

S. C. GlobalStats, “Search Engine Host Market Share Worldwide,” GlobalStats, Stat Counter, 2023. https://gs.statcounter.com/search-engine-host-market-share (accessed Apr. 03, 2023).

Google, “How Google Search Works,” Google Search, 2023. https://www.google.com/search/howsearchworks/how-search-works/ (accessed Apr. 03, 2023).

S. Fatima, S. Luqmaan, and N. A. Rasheed, “Web Scraping with Python and Selenium,” IOSR J. Comput. Eng., vol. 23, no. 3, pp. 1–5, 2021, doi: 10.9790/0661-2303020105.

A. Rahmatulloh and R. Gunawan, “Web Scraping with HTML DOM Method for Data Collection of Scientific Articles from Google Scholar,” Indones. J. Inf. Syst., vol. 2, no. 2, pp. 95–104, 2020, doi: 10.24002/ijis.v2i2.3029.

V. A. Flores, P. A. Permatasari, and L. Jasa, “Penerapan Web Scraping Sebagai Media Pencarian dan Menyimpan Artikel Ilmiah Secara Otomatis Berdasarkan Keyword,” Maj. Ilm. Teknol. Elektro, vol. 19, no. 2, pp. 157–162, 2020, doi: 10.24843/mite.2020.v19i02.p06.

L. Gotsev and E. Shoikova, “An Analysis of Scientific Production in Big Data Knowledge Domain on Google Books, YouTube and IEEE Explore® Digital Library,” in Proceedings of the 2020 4th International Conference on Cloud and Big Data Computing, 2020, pp. 10–14, doi: 10.1145/3416921.3416936.

W. Nel, L. De Wet, and R. Schall, “Randomised Controlled Trial of the Usability of Major Search Engines (Google, Yahoo! And Bing) When using Ambiguous Search Queries,” in Proceedings of the 4th International Conference on Computer-Human Interaction Research and Applications (CHIRA 2020), 2020, no. November, pp. 152–161, doi: 10.5220/0010133601520161.

C. Ziakis, M. Vlachopoulou, T. Kyrkoudis, and M. Karagkiozidou, “Important Factors for Improving Google Search Rank,” Futur. Internet, vol. 11, no. 32, pp. 1–12, 2019, doi: 10.3390/fi11020032.

D. Trielli and N. Diakopoulos, “Search as News Curator: The Role of Google in Shaping Attention to News Information,” in 2019 CHI Conference on Human Factors in Computing Systems Proceedings (CHI 2019), 2019, pp. 1–15, doi: 10.1145/3290605.3300683.

P. C. Patil, P. M. Chawan, and P. M. Chauhan, “Parsing of HTML Document,” Int. J. Adv. Res. Comput. Eng. Technol., vol. 1, no. 4, pp. 320–324, 2012.

M. Radilova, P. Kamencay, R. Hudec, M. Benco, and R. Radil, “Tool for Parsing Important Data from Web Pages,” Appl. Sci., vol. 12, no. 12031, pp. 1–18, 2022, doi: 10.3390/app122312031.

R. Gunawan, A. Rahmatulloh, I. Darmawan, and F. Firdaus, “Comparison of Web Scraping Techniques : Regular Expression, HTML DOM and Xpath,” Atl. Highlights Eng., vol. 2, no. IcoIESE 2018, pp. 283–287, 2019, doi: 10.2991/icoiese-18.2019.50.

A. Backurs and P. Indyk, “Which Regular Expression Patterns Are Hard to Match?,” Proc. - Annu. IEEE Symp. Found. Comput. Sci. FOCS, pp. 1–33, 2016, doi: 10.1109/FOCS.2016.56.

V. Bhateja, S. C. Satapathy, and H. Satori, Embedded Systems and Artificial Intelligence, vol. 1171. Singapore: Advances in Intelligent Systems and Computing, 2020.