Dataset extraction from advertising marketplaces: experiences with Facebook, Olx, and Mercadolivre
Keywords:datasets, python, ecommerce, marketplaces, advertisings
Considering the increasing amount of information produced in digital marketplaces, the growing adhesion to these types of services by internet users in Brazil and the world, and the lack of work related to this topic, this research aims to experiment with dataset extraction ads. After analyzing the main e-commerce spaces in Brazil, the chosen marketplaces were: Mercado Livre, Facebook, and OLX. Python's programming language uses the following libraries: scrappy, beautifulsoup, and Selenium Webdriver. After analyzing the web structure of the ad results pages, scripts were created to extract the main variables of the ad within a common category among the marketplaces. The results show that scrapers can remove datasets from advertisements on these platforms in different formats. Such information has potential for exploration in various segments of data science
Castillo, B. A. V. (2020). Desarrollo de sistema de análisis de empleabilidad en portales web de empleos. Escuela Politécnica Nacional: ECUADOR. Disponível em https://bibdigital.epn.edu.ec/handle/15000/21177
CETIC.BR. (2019). Pesquisa sobre o uso das Tecnologias de Informação e Comunicação nos domicílios brasileiros - TIC Domicílios 2019. Disponível em: https://cetic.br/pt/publicacoes/indice/pesquisas/
CRUMMY. (2020). Beautiful Soup Documentation for Python. Disponível em: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
ECOMMERCEBRASIL. (2021). E-commerce brasileiro cresce 73,88% em 2020, revela índice MCC-ENET. 2021. Disponível em https://www.ecommercebrasil.com.br/noticias/e-commerce-brasileiro-cresce-dezembro/
Fiesler, C., Beard, N., & Keegan, B. C. (2020). No Robots, Spiders, or Scrapers: Legal and Ethical Regulation of Data Collection Methods in Social Media Terms of Service. Proceedings of the International AAAI Conference on Web and Social Media, 14(1), 187-196. Retrieved from https://ojs.aaai.org/index.php/ICWSM/article/view/7290
Gerhardt, T. E., & Silveira, D. T. (2009). Métodos de pesquisa. Plageder.
Fathalla, A., Salah, A., Li, K., Li, K., & Francesco, P. (2020). Deep end-to-end learning for price prediction of second-hand items. Knowledge and Information Systems, 62(12), 4541-4568. https://doi.org/10.1007/s10115-020-01495-8
OLX. (2021). Institucional: Quem somos. 2021. Disponível em: https://portalolx.olx.com.br/quem-somos/
Pandey, A. Car’s Selling Price Prediction using Random Forest Machine Learning Algorithm. Março de 2020. 5th International Conference on Next Generation Computing Technologies (NGCT-2019). http://dx.doi.org/10.2139/ssrn.3702236
SCRAPY. (2021). An open source and collaborative framework for extracting the data you need from websites. Disponível em: https://scrapy.org/
Thivaharan, S., Srivatsun, G., & Sarathambekai, S. (2020, September). A survey on python libraries used for social media content scraping. In 2020 International Conference on Smart Electronics and Communication (ICOSEC) (pp. 361-366). https://doi.org/10.1109/ICOSEC49089.2020.9215357
TRENDS. (2021). Google Trends. Pesquisas relacionadas a marketplaces OLX, Facebook e Mercado Livre.
Wijaya, D. R., Paramita, N. L. P. S. P., Uluwiyah, A., Rheza, M., Zahara, A., & Puspita, D. R. (2020). Estimating city-level poverty rate based on e-commerce data with machine learning. Electronic Commerce Research, 1-27. http://dx.doi.org/10.1007/s10660-020-09424-1
Xu, Q., Cai, M., & Mackey, T. K. (2020). The illegal wildlife digital market: an analysis of Chinese wildlife marketing and sale on Facebook. Environmental Conservation, 47(3), 206-212. http://dx.doi.org/10.1017/S0376892920000235
Zaheer, M. S. Random Forest Regression on OLX’s Dataset. 2018. Medium.com. Disponível em: https://medium.com/@msz991/random-forest-regression-on-olxs-dataset-5d108f027257
How to Cite
Copyright (c) 2022 Eduardo Diniz, Gustavo Medeiros de Araújo
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) which permits copying and redistributing the material in any medium or format, adapting, transforming and building upon the material as long as the license terms are followed.