In my research, I often combine in-depth interviews with computational methods. I consider both approaches complementary. On the one hand, data analysis can suggest unsuspected angles for interviews. On the other, a good interview can explain otherwise enigmatic patterns in the data.

Social Media and the Rise of the New Right in Brazil

My doctoral research was a unique opportunity to delve into themes that sit at the crossroads of social behavior, communication strategy, and technological innovation. It encompassed topics such as business models made possible by digital platforms, public impact of algorithmic recommendations, political realignment in the internet age, and the use of social media to challenge established institutions.

I looked into YouTube and Twitter data, as well as sixty-five in-depth interviews with influencers, activists, religious leaders, businesspeople, journalists, and intellectuals associated with the new Right in Brazil.

My doctoral dissertation can be downloaded here.

Methods: Network analysis, regression analysis, in-depth interviews, oral history, archival research.

Tools: Python, R, Jupyter, Gephi, Presto, AWS Cloud (Athena, EC2, S3, SQS, CloudWatch, Lambda, etc), various transcription services (Azure, IBM Watson, Google Cloud, and AWS), Twitter API, YouTube API.

Source code (GitHub): data collection pipeline, automated transcription of interviews, network analysis, and regression analysis

Citizens Exposed to Dissimilar Views in the Media: Investigating Backfire Effects

In 2019, I joined a research project that investigates backfire effects on media debates—in other words, the extent to which exposure to a viewpoint distinct from our own may harden our critical disposition against that same viewpoint.

The project is based in the University of Amsterdam, funded by the European Union, and led by professor Magdalena Wojcieszak, from University of California, Davis.

It relies on a data set that unites longitudinal surveys on contentious issues with the browsing history of the same research subjects who answered those surveys. Thus, we are able to explore the relationship between digital media consumption and opinions held over time. The research design also permits international comparisons with respondents located in three countries—the United States, Poland, and the Netherlands.

I have been responsible for designing the database structure and guaranteeing that the data is secure, accessible, and queryable. A number of publications have already come out of this project. I am a coauthor in a few that are still being peer-reviewed by academic journals. For example, this one has already been accepted by Political Communication. Another one is forthcoming in Humanities and Social Sciences Communications.

Methods: Longitudinal surveys, online surveys, browsing history analysis, natural language processing.

Tools: PostgreSQL, Presto, Athena, Python.

Source code (GitHub): Exporting data from PostgreSQL to Athena, project’s repository.

Tweeting About Tax Avoidance: How NGOs and Journalists Create Salience in a World Crowded with Good Causes

The goal of this study was to investigate the impact of recent offshore leaks—Panama Papers, Paradise Papers, Lux Leaks, Swiss Leaks, Pandora Papers, etc.—on the global debate about tax havens and tax fairness. Professor Anya Schiffrin, from Columbia School of International and Public Affairs, and Shant Fabricatorian, a PhD candidate in my program, were my partners in this research project.

I was responsible for the computational and quantitative analysis of the data. I created a comprehensive corpus of all the online articles written in English about this topic. Then, I scraped the content of each one of them and, with the help of natural language processing techniques, identified the most common keywords and concepts in our corpus. I also collected the social media engagement of each one of those articles in order to determine their relative influence.

Professor Schiffrin and Shant conducted in-depth interviews with tax justice activists and journalists who have contributed to the public outcry against tax havens. My research partners wanted to understand those activists’ and journalists’ assessment of the role of those scandals in shaping public debates about tax policy in the past ten years.

Shant and I presented the results of our research in the annual conference of the International Communication Association, in 2019. It is going to be published later this year as a chapter of a Routledge companion on business journalism.

Methods: Content analysis, natural language processing (keyword and concept extraction, entity recognition), in-depth interviews.

Tools: Python, IBM Watson.

Exploring Supervised and Unsupervised AI Techniques for Text Classification

During the Summer of 2015, I worked as the Google News Lab Fellow at the Pew Research Center, in Washington, DC.

The center was making an effort to add machine learning and computational methods to the toolkit of its researchers. That interest was evident in recent hires. Solomon Messing—a seasoned researcher from Facebook’s core data science team—had just been invited to found and head Pew’s Data Labs. The journalism and media team had also brought in a computational social scientist, Galen Stocking.

I applied supervised and unsupervised methods to classify an immense corpus of digital news from a wealth of tracking data provided by a major content analytics company.

Sadly, we later concluded that the tracking data was not suited to answer the questions on news consumption patterns that our team was posing. As a result, we did not move forward with that dataset and my classifiers went unused.

However, I had the opportunity to share with my colleagues the techniques and code that I had employed to identify topics and categories in that corpus. I also presented a brief overview of supervised and unsupervised methods for textual analysis to researchers from all departments at Pew to illustrate the possibilities offered by these tools.

In the following year, a group of colleagues from the media team published a report on news consumption on Reddit. They used a supervised method very similar to the one that I had applied to our previous content analytics corpus. (To be clear: I was not directly involved with that study. I am only mentioning it to show that those initial attempts to make use of machine learning—that I was a part of—began bearing fruit a few months later.)

Methods: LDA-Based Topic Modeling, Support-Vector Machines.

Tools: Google BigQuery, Scikit-learn, Mallet.

Flash Mobs of Low-Income Youth in Brazilian Malls: A Study on Media Frames

My master’s research focused on a phenomenon that puzzled Brazilian news analysts and sociologists in 2013 and 2014: hundreds, sometimes thousands, of low-income youth would get together in shopping malls. Those gatherings were called rolezinhos—a Portuguese slang word whose literal meaning is little stroll. The rolezinhos were coordinated on social media, especially Facebook. They were not intended as protests. The young participants just wanted to chat, laugh, flirt, and have fun in a crowded environment. Yet, most malls resented the disruption it caused for customers and shop owners. The police was often called to disband the party and, in a few times, resorted to violence.

A passionate debate ensued on newspaper pages, the airwaves, and social media feeds. Some commentators depicted the participants in rolezinhos as scoundrels who disturbed the peace. Others criticized shopping malls and shop owners for their prejudiced attitude against poor people. There were also those who celebrated rolezinhos as a sign of the triumph of capitalism and that low-income Brazilians, like everyone else, just aspired to a consumerist extravaganza.

In my research, I identified those media frames and quantified their relative importance in the public debate. I used Media Cloud—a tool developed by the Berkman Center for Internet and Society, at Harvard—to create a corpus of the online articles on this topic. At that time, Media Cloud was not adapted to the Brazilian media ecosystem. I had to add a collection of the most important media sources in Brazil. In doing so, I created a road map for other foreign researchers who wanted to use Media Cloud to study media environments where English was not the dominant language.

I used a semi-automated method to identify the media frames in my corpus. It entailed first manually coding the texts and then performing a hierarchical clustering analysis. For each one of the texts in my corpus, I also retrieved the corresponding Facebook engagement. Social media figures thus became my proxy for relevance of a commentary or news piece on this topic. Then, I interviewed the most influential content producers to understand their media strategy.

My master’s thesis can be downloaded here.

Methods: Hierarchical clustering, semi-automated content analysis, qualitative coding, in-depth interviews.

Tools: R, Python, Facebook API.

Source code (GitHub): Web scraping, API communication, database management, statistical analysis.

Memories From the East: Migrants Who Sought Shelter in Brazil in the Wake of Nazism and Stalinism

In contrast with my later research projects, my undergraduate thesis did not involve any computational methods. It was an oral history of people who migrated to Brazil from Eastern Europe fleeing from misery and persecution under the Nazi occupation or behind the iron curtain.

I conducted a dozen in-depth interviews with immigrants who moved to Brazil in the mid-twentieth century. They came from Russia, Croatia, Hungary, Lithuania, Romania, Slovenia, and Poland. My undergraduate thesis is available in Portuguese.

Methods: Oral history, in-depth interviews, archival research.