[WIKI 프리즘] 페이스북 '소셜 사이언스 원' 이용자들의 정보를 분석할 민간 CIA?

페이스북은 최근 페이스북 플랫폼의 방대한 데이터를 이용해 사회적 정치적으로 중요한 문제들을 조명하는 독립 연구위원회를 설립하고 학술 연구단체와 협업할 것임을 발표한 바 있다.

드디어 베일 밖으로 나온 이 위원회의 이름은 ‘소셜 사이언스 원(Social Science One)’으로 이 조직의 프로젝트는 페타바이트급의 방대한 공유 데이터들을 연구원들에게 분석을 하게 하는 것이다.

소셜 사이언스 원의 기본적인 시스템은 연구 그룹에게 페이스북의 프로세스와 데이터 세트의 접근 권한을 주는 것이다. 이들은 연구원으로서의 경험을 바탕으로 흥미로운 데이터 세트들을 찾고, 설계하고, 이를 문서화하여 대중들에게 공개할 것이다. 예를 들자면 브렉시트 투표가 벌어지는 주간 동안 천만명의 상태 업데이트를 여러 메타데이터와 함께 어느 데이터 세트로 보여줄 수 있는 것이다.

데이터 세트에 대한 문서 기록은 연구 단체의 제안서로 쓰일 수 있다. 데이터 세트에 관심이 있는 다른 연구자들이 분석이나 실험을 제안하면, 이에 대해 위원회가 평가하는 것이다. 페이스북 이용자들이 함께 소셜 사이언스 연구 위원회의 도움으로 이런 제안들을 검토할 것이다. 제안에 효용성이 보이면 자금을 마련하고 데이터를 이용하는 등의 자격을 얻게 된다. 연구 결과는 페이스북이나 위원회의 사전 승인 등의 제한없이 연구자들이 원하는 대로 공개될 수 있다.

소셜 사이언스 원의 공동 설립자이자 하버드의 교수인 개리 킹은 블로그에 이 계획을 발표하면서 ‘민간 회사들이 수집한 데이터는 사회과학자들이 사회의 아주 큰 문제들을 이해하고 해결하는 데 도움이 될 수 있는 가능성을 갖고 있다. 그러나 현재까지 학술 연구를 위해 이런 데이터에 접근하는 것은 불가능했다. 소셜 사이언스 원이 학술지 출판의 자유를 보장하면서, 사회적 선에 부응해 개인정보를 보호하는 데이터를 수집하기 위한 윤리적 구조를 만들었다’고 말했다.

페이스북과 이 프로젝트의 자금을 지원해주는 재단들은 킹과 네이트 퍼실리를 공동 설립자로 선택했고, 이후 위원회의 다른 학자들을 선별했다.

전세계 페이스북 이용자들에 의해 공유되고 조회된 거의 모든 공개 링크들로 많은 유용한 메타데이터를 끌어온 이들의 첫 데이터 세트는 흥미롭다고 평가되고 있다. 이 데이터 세트에는 일주일에 3억개의 포스트에 공유된 약 2백만개의 독특한 링크들이 포함된다고 한다. 300억의 데이터열, 즉 페타바이트급의 규모의 데이터가 들어있는 것으로 추정된다.

메타데이터에는 이용자의 국적, 나이, 사용기기 뿐만 아니라 이념적 소속, 게시글을 조회한 친구와 비친구의 비율, 피드 포지션, 그리고 공유, 클릭, ‘좋아요’, ‘하트’, ‘신고’의 총 수도 포함된다. 분류할 자료가 무수히 많은 것이다. 이 모든 데이터들은 이용자들이 개인정보 보호를 위해 세심하게 축소되고 잘려 나간다. 이는 캠브리지 애널리티카가 했듯 정보를 마구 빼낸 것이 아닌 적법한 연구 데이터 세트라고 전해진다.

킹은 인터뷰에서 위원회가 허위정보와 편향성, 선거의 공정성, 정치 홍보, 시민 참여 등 소셜 미디어와 민주주의의 근본적인 문제들에 초점을 맞춰 더 많은 데이터를 공급받는다고 말했다.
그 밖의 데이터 세트들이 완성과 승인의 단계에 있다.

멕시코와 그 외 지역의 선거 설문 참가자들에게 본인의 대답이 페이스북 프로필에 연동돼도 되는지 허락을 요할 것이고, 정치 홍보 아카이브가 공식적으로 가능해질 것이다. 또한 소셜 미디어 모니터 플랫폼 크라우드탱글(CrowdTangle)과의 데이터 작업이 이루어지는 등 세계 다른 연구 기관들과의 다양한 파트너십이 이뤄지고 있다. 페이스북과 인스타그램의 모든 공개 포스트에 대한 지속적인 피드와 페이스북 뉴스피드의 방대한 랜덤 표본 또한 철저한 검토와 절차 함께 심의 중에 있다.

물론 좋은 연구를 위해서는 비용이 들게 마련인데, 소셜 사이언스 원이 지출하는 연구비는 페이스북이 아닌 몇몇 재단에 의해 조성된 것으로 이에 대한 명시와 연구비 사용의 투명성이 요구되고 있다.

처음 프로젝트를 시작할 때 대표를 정한 것 외에 페이스북은 전체 조직 시스템에 직접적인 개입을 하지 않고 있다. 누구나 신뢰할 수 있는 연구를 위해 이런 독립성은 중요한 것이다.

Facebook independent research commission, Social Science One, will share a petabyte of user interactions

Back in April, Facebook announced it would be working with a group of academics to establish an independent research commission to look into issues of social and political significance using the company’s own extensive data collection. That commission just came out of stealth; it’s called Social Science One, and its first project will have researchers analyzing about a petabyte’s worth of sharing data and metadata.

The way the commission works is basically that a group of academics is created and given full access to the processes and data sets that Facebook could potentially provide. They identify and help design interesting sets based on their experience as researchers themselves, then document them publicly — for instance, a set (imaginary for now) may be described 10 million status updates taken during the week of the Brexit vote, with such and such metadata included.

This documentation describing the set doubles as a “request for proposals” from the research community. Other researchers interested in the data propose analyses or experiments, which are evaluated by commission. These proposals will be peer-reviewed with help from the Social Science Research Council. If a proposal has merit, it may be awarded funding, data, and other benefits; resulting papers can be published however the researchers wish, with no restrictions like pre-approval by Facebook or the commission.

“The data collected by private companies has vast potential to help social scientists understand and solve society’s greatest challenges. But until now that data has typically been unavailable for academic research,” said Social Science One co-founder, Harvard’s Gary King, in a blog post announcing the initiative. “Social Science One has established an ethical structure for marshaling privacy preserving industry data for the greater social good while ensuring full academic publishing freedom.”

If you’re curious about the specifics of the partnership, it’s actually been described in a paper of its own, available here. Nate Persily is the other co-chair; he and King were selected by Facebook and the foundations funding the project (listed below), who then selected the other scholars in the group.

The first data set is a juicy one: “almost all” public URLs shared and clicked by Facebook users globally, accompanied by a host of useful metadata.

It will contain “on the order of 2 million unique URLs shared in 300 million posts, per week,” reads a document describing the set. “We estimate that the data will contain on the order of 30 billion rows, translating to an effective raw size on the order of a petabyte.”

The metadata includes country, user age, device and so on, but also dozens of other items, such as “ideological affiliation bucket,” the proportion of friends versus non-friends who viewed a post, feed position, the number of total shares, clicks, likes, hearts, flags… there’s going to be quite a lot to sort through. Naturally all this is carefully pruned to protect user privacy — this is a proper research data set, not a Cambridge Analytica-style catch-all siphoned from the service.

In a call accompanying the announcement, King explained that the commission had much more data coming down the pipeline, with a focus on disinformation, polarization, election integrity, political advertising and civic engagement.

“It really does get at some of the fundamental questions of social media and democracy,” King said on the call.

The other sets are in various stages of completeness or permission: post-election survey participants in Mexico and elsewhere are being asked if their responses can be connected with their Facebook profiles; the political ad archive will be formally made available; they’re working on something with CrowdTangle; there are various partnerships with other researchers and institutions around the world.

A “continuous feed of all public posts on Facebook and Instagram” and “a large random sample of Facebook newsfeeds” are also under consideration, probably encountering serious scrutiny and caveats from the company.

Of course, quality research must be paid for, and it would be irresponsible not to note that the grants being disbursed by Social Science One are funded not by Facebook but by a number of foundations: the Laura and John Arnold Foundation, The Democracy Fund, The William and Flora Hewlett Foundation, The John S. and James L. Knight Foundation, The Charles Koch Foundation, Omidyar Network’s Tech and Society Solutions Lab and The Alfred P. Sloan Foundation.

To be clear (you can never be too clear when funding is involved), the foundations put their money into SSRC’s Social Data Initiative, from which shared fund it is then distributed both to cover Social Science One’s operations and the grants. Facebook, everyone involved in this repeatedly told me, is out of the loop except for having helped pick the co-chairs at the beginning. That independence is critical, of course, if anyone is to trust the resulting research.

You can keep up with the organization’s work here; it really is a promising endeavor and will almost certainly produce some interesting science — though not for some time. We’ll keep an eye out for any research emerging from the partnership.

Update: The original headline described the dataset as “user data,” which I don’t think is inaccurate, but the organization’s suggested description of it as “URL data” is, I think, inadequate. I’ve settled for “user interactions,” since that’s more what the dataset is focused on anyway. I also made some slight changes to reflect that the SSRC reviews the proposals, not the papers, and to add the selection process for the co-chairs and other academics.

6677sky@naver.com

최정미 기자 다른기사 보기