A Structured Data Wrangling Pipeline for TikTok Datasets Using Pandas Python

Agus Suharto, Muhammad Syarif Hartawan

Abstract


This study aims to develop a structured data wrangling pipeline for TikTok datasets using the Pandas Python library. The purpose of the research is to transform raw social media data into clean, consistent, and analyzable formats that can support academic inquiry into digital engagement patterns.The methodology consists of five stages: data loading, cleansing, transformation, feature engineering, and validation. Raw TikTok data, including video metadata, user interactions (likes, comments, shares), and hashtags, were processed to remove inconsistencies, handle missing values, and standardize formats. Feature engineering was applied to derive analytical variables such as engagement rate, posting frequency, and hashtag clustering. Validation ensured structural integrity, completeness, and consistency of the dataset, enabling reliable statistical analysis. The results demonstrate that systematic wrangling improves dataset quality, enhances interpretability, and enables advanced analysis of user behavior and content trends. By applying Pandas-based operations, the study provides a reproducible framework that bridges technical rigor with methodological transparency. This research contributes to the academic field of social media analytics by offering a practical pipeline for TikTok data preparation. It highlights the importance of data wrangling not merely as a preparatory step, but as a methodological foundation for evidence-based digital research.


Keywords


Pandas, Data Wrangling, Social Media, Tiktok, Digital Research

Full Text:

PDF

References


E. Kross, P. Verduyn, G. Sheppes, C. K. Costello, J. Jonides, and O. Ybarra, “Social Media and Well-Being: Pitfalls, Progress, and Next Steps,” Trends Cogn. Sci., vol. 25, no. 1, pp. 55–66, 2021, doi: 10.1016/j.tics.2020.10.005.

Dr. Lohans Kumar Kalyani, “The Role of Technology in Education: Enhancing Learning Outcomes and 21st Century Skills,” Int. J. Sci. Res. Mod. Sci. Technol., vol. 3, no. 4, pp. 05–10, 2024, doi: 10.59828/ijsrmst.v3i4.199.

P. Kumar, “The Rise of Short-Form Video: A Digital Revolution,” Int. J. Res. Publ. Rev. J. homepage www.ijrpr.com, no. 6, pp. 6939–6948, 2025, [Online]. Available: https://doi.org/10.5281/zenodo.15667258

M. Masood, K. SHREYA, L. ZIKUN, V. DEEPAK, and G. INDRANIL, Counting How the Seconds Count: Understanding Algorithm-User Interplay in TikTok via ML-driven Analysis of Video Content, vol. 1, no. 1. arXiv, 2025. doi: 10.1145/3772318.3790311.

R. Zhou, “Understanding the Impact of TikTok ’ s Recommendation,” Int. J. Comput. Sci. Inf. Technol., vol. 3, no. 2, pp. 201–208, 2024, [Online]. Available: https://doi.org/10.62051/ijcsit.v3n2.24

P. Martins, F. Cardoso, P. Váz, J. Silva, and M. Abbasi, “Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets,” Data, vol. 10, no. 5, pp. 1–22, 2025, doi: 10.3390/data10050068.

W. Kanyongo, A. E. S. Ezugwu, T. Moyo, and J. V. F. Dombeu, “Data Wrangling and Generation for Machine Learning Models in Medication Adherence Analytics: A practical Standpoint using Patient-Level and Medical Claims Data,” Data Intell., vol. 7, no. 2, pp. 485–526, 2025, doi: 10.3724/2096-7004.di.2024.0037.

M. Guo et al., “Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint,” Interact. J. Med. Res., vol. 12, p. e44310, 2023, doi: 10.2196/44310.

H. P. Kothandapani and C. Charterholder, “A Benchmarking and Comparative Analysis of Python Libraries for Data Cleaning: Evaluating Accuracy, Processing Efficiency, and Usability Across Diverse Datasets A Benchmarking and Comparative Analysis of Python Libraries for Data Cleaning: Evaluating Accu,” Eig. Rev. Sci. Technol., vol. 5, pp. 16–33, 2021, [Online]. Available: https://www.researchgate.net/publication/386176280

Adi Nova Trisetyanto and Handini Arga Damar Rani, “Pengembangan Modul Belajar Robotika Berbasis Internet of Things (IoT) pada Program Studi Pendidikan Informatika, Fakultas Sains dan Teknologi, Universitas Ivet,” Joined J. (Journal Informatics Educ., vol. 6, pp. 79–90, 2023.

H. Nathasya, “No TitleΕΛΕΝΗ,” Edu Res. Indones. Inst. Corp. Learn. Stud., vol. 5, no. 1, pp. 70–80, 2024.

A. Rejeb, K. Rejeb, A. Appolloni, and H. Treiblmaier, Foundations and knowledge clusters in TikTok (Douyin) research: evidence from bibliometric and topic modelling analyses, vol. 83, no. 11. Springer US, 2024. doi: 10.1007/s11042-023-16768-x.

W. are Social, “https://wearesocial.com/id/blog/2025/02/digital-2025/”, [Online]. Available: https://wearesocial.com/id/blog/2025/02/digital-2025/

R. K. Jaiswal and R. Sharma, “Enhancing Data Processing Efficiency and Scalability: A Comprehensive Study on Optimizing Data Manipulation with Pandas,” Resmilitaris, vol. 10, no. 1, 2024, doi: 10.48047/resmil.v10i1.14.

C. Bruehl, “A Gentle Introduction to Python’s Pandas Library — The First 5 Functions You Need to Know,” Medium. Accessed: Mar. 17, 2026. [Online]. Available: https://medium.com/learning-data/a-gentle-introduction-to-pythons-pandas-library-the-first-5-functions-you-need-to-know-fc045e24f3c8

P. Koukaras and C. Tjortjis, “Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices,” AI, vol. 6, no. 10, 2025, doi: 10.3390/ai6100257.

“Data Pipelines 101: Architecture and Implementation,” coalesce. Accessed: Mar. 18, 2025. [Online]. Available: https://coalesce.io/data-insights/data-pipelines-101-architecture-and-implementation/

R. Vinogradov, “Data Loading: A Complete Guide,” Improvado. Accessed: Mar. 17, 2026. [Online]. Available: https://improvado.io/blog/data-loading

Team Code Signal, “How to analyze large datasets with Python: Key principles & tips,” Codesignal. Accessed: Mar. 19, 2025. [Online]. Available: https://codesignal.com/blog/how-to-analyze-large-datasets-with-python-key-principles-tips/

A. A. Omoseebi, G. Ola, and J. Tyler, “Authors,” Data Prep. Featur. Eng., 2025.

G. Jaimovitch-López, C. Ferri, J. Hernández-Orallo, F. Martínez-Plumed, and M. J. Ramírez-Quintana, “Can language models automate data wrangling?,” Mach. Learn., vol. 112, no. 6, pp. 2053–2082, 2023, doi: 10.1007/s10994-022-06259-9.

A. NB, “Data Aggregation, Grouping and Merging Made Easy with Pandas: A Data Science Series,” Medium. Accessed: Mar. 18, 2025. [Online]. Available: https://python.plainenglish.io/data-aggregation-grouping-and-merging-made-easy-with-pandas-a-data-science-series-a11b49fde55f

M. Bazeley, “The Feature Engineering Guide,” Featureform. Accessed: May 18, 2026. [Online]. Available: https://www.featureform.com/post/feature-engineering-guide

“Top Data Validation Techniques for Building Trusted Data Pipelines,” Alation. Accessed: Mar. 18, 2025. [Online]. Available: https://www.alation.com/blog/data-validation-techniques/

C. N. Steltenpohl et al., “Rethinking Transparency and Rigor from a Qualitative Open Science Perspective,” J. Trial Error, vol. 4, no. 1, pp. 47–59, 2023, doi: 10.36850/mr7.




DOI: https://doi.org/10.55311/aiocsit.v6i1.397

Refbacks

  • There are currently no refbacks.