Python Data Collection and Management for Public Policy Research

Instructor Name

Blake Miller

Instructor Biography

Blake Miller is an Assistant Professor of Computational Social Science in the Methodology Department at the London School of Economics and Political Science. He received his PhD in Political Science and Scientific Computing from the University of Michigan in 2018 where he was also a graduate research affiliate in the Lieberthal-Rogel Center for Chinese Studies. Before coming to LSE, he was a Post-Doctoral Fellow at the Dartmouth College Program in Quantitative Social Science. Blake has also spent several years in Silicon Valley as an executive for tech start-up companies. For more information, please visit

Course Description

The massive amount of data available online continues to increase the bounds of social scientific inquiry. Researchers in both academia and the private sector can gain a greater understanding of human behavior by analyzing the abundant social data stored online. To make use of these data, one must first master technical skills necessary to gather and process these data, which can be quite challenging to do properly.

The main goal of this course is to provide students with the necessary tools for the construction, processing, and cleaning of data found online. After taking this course, students will have mastered the requisite tools needed to construct datasets out of unstructured, semi-structured, and structured online data.

Course Schedule

Ten Session Topics

  • Introduction to ‘big data,’ data ethics, introducing Python

  • Reproducibility and git, Github

  • Basic python data structures and coding.

  • More basic Python, introduction to Pandas. How to store and access data in different formats.

  • Introduction to HTML, parsing HTML: How to use basic HTML and CSS to extract information automatically from websites.

  • Introduction to web scraping. Write a program to dynamically crawl a website and gather relevant data.

  • Introduction to APIs. How to incorporate social network and geolocation data (e.g. from Twitter, Weibo, Google, Baidu, etc.) in one’s data.

  • Basic SQL, using relational databases and how to access them via Python.

  • Dealing with dirty data, fuzzy string matching, regular expressions.

  • Putting it all together, discussing applications, roadmap to future learning.

Learning Outcomes

The aims of this course are:

  • to introduce students to important concepts and methodologies related to the management, collection, processing, and cleaning of data for social science and public policy research;

  • to teach students practical concerns and best practices for data management and data collection;

  • to build foundational skills necessary to construct useful datasets for their research from unstructured, semi-structured, and secondary data;