Planned maintenance
A system upgrade is planned for 24/9-2024, at 12:00-14:00. During this time DiVA will be unavailable.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Do We Really Need To Catch Them All?: A New User-Guided Social Media Crawling Method
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science and Engineering.ORCID iD: 0000-0003-3219-9598
Wrocław University of Science and Technology, POL.ORCID iD: 0000-0002-6474-0089
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science and Engineering.ORCID iD: 0000-0002-9316-4842
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science and Engineering.
2017 (English)In: Entropy, E-ISSN 1099-4300, Vol. 19, no 12, article id 686Article in journal (Refereed) Published
Abstract [en]

With the growing use of popular social media services like Facebook and Twitter it is hard to collect all content from the networks without access to the core infrastructure or paying for it. Thus, if all content cannot be collected one must consider which data are of most importance.In this work we present a novel User-Guided Social Media Crawling method (USMC) that is able to collect data from social media, utilizing the wisdom of the crowd to decide the order in which user generated content should be collected, to cover as many user interactions as possible. USMC is validated by crawling 160 Facebook public pages, containing 368 million users and 1.3 billion interactions, and it is compared with two other crawling methods. The results show that it is possible to cover approximately 75% of the interactions on a Facebook page by sampling just 20% of its posts, and at the same time reduce the crawling time by 53%.What is more, the social network constructed from the 20% sample has more than 75% of the users and edges compared to the social network created from all posts, and has very similar degree distribution.

Place, publisher, year, edition, pages
MDPI AG , 2017. Vol. 19, no 12, article id 686
Keywords [en]
social media, social networks, sampling, crawling
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:bth-15508DOI: 10.3390/e19120686ISI: 000419007900055OAI: oai:DiVA.org:bth-15508DiVA, id: diva2:1157267
Available from: 2017-11-15 Created: 2017-11-15 Last updated: 2023-03-28Bibliographically approved
In thesis
1. Human Interactions on Online Social Media: Collecting and Analyzing Social Interaction Networks
Open this publication in new window or tab >>Human Interactions on Online Social Media: Collecting and Analyzing Social Interaction Networks
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Online social media, such as Facebook, Twitter, and LinkedIn, provides users with services that enable them to interact both globally and instantly. The nature of social media interactions follows a constantly growing pattern that requires selection mechanisms to find and analyze interesting data. These interactions on social media can then be modeled into interaction networks, which enable network-based and graph-based methods to model and understand users’ behaviors on social media. These methods could also benefit the field of complex networks in terms of finding initial seeds in the information cascade model. This thesis aims to investigate how to efficiently collect user-generated content and interactions from online social media sites. A novel method for data collection that is using an exploratory research, which includes prototyping, is presented, as part of the research results in this thesis.

 

Analysis of social data requires data that covers all the interactions in a given domain, which has shown to be difficult to handle in previous work. An additional contribution from the research conducted is that a novel method of crawling that extracts all social interactions from Facebook is presented. Over the period of the last few years, we have collected 280 million posts from public pages on Facebook using this crawling method. The collected posts include 35 billion likes and 5 billion comments from 700 million users. The data collection is the largest research dataset of social interactions on Facebook, enabling further and more accurate research in the area of social network analysis.

 

With the extracted data, it is possible to illustrate interactions between different users that do not necessarily have to be connected. Methods using the same data to identify and cluster different opinions in online communities have also been developed and evaluated. Furthermore, a proposed method is used and validated for finding appropriate seeds for information cascade analyses, and identification of influential users. Based upon the conducted research, it appears that the data mining approach, association rule learning, can be used successfully in identifying influential users with high accuracy. In addition, the same method can also be used for identifying seeds in an information cascade setting, with no significant difference than other network-based methods. Finally, privacy-related consequences of posting online is an important area for users to consider. Therefore, mitigating privacy risks contributes to a secure environment and methods to protect user privacy are presented.

Place, publisher, year, edition, pages
Karlskrona: Blekinge Tekniska Högskola, 2018
Series
Blekinge Institute of Technology Doctoral Dissertation Series, ISSN 1653-2090 ; 1
Keywords
Social Media, Social Networks, Crawling, Complex Networks, Information Cascade, Seed Selection, Privacy
National Category
Computer Sciences
Identifiers
urn:nbn:se:bth-15503 (URN)978-91-7295-344-4 (ISBN)
Public defence
2017-01-15, J1650, Karlskrona, 13:00 (English)
Opponent
Supervisors
Available from: 2017-11-23 Created: 2017-11-15 Last updated: 2022-05-25Bibliographically approved

Open Access in DiVA

fulltext(774 kB)316 downloads
File information
File name FULLTEXT02.pdfFile size 774 kBChecksum SHA-512
bcbb09196de5d0da4d504457b4885232a4853eb95ef922f3c95086433c570efcb5e5a19736c9ef2cebe0d47f53edcae7030f06a8e65a151ce2e65169c8f56d54
Type fulltextMimetype application/pdf

Other links

Publisher's full textDo We Really Need to Catch Them All? A New User-Guided Social Media Crawling Method

Authority records

Boldt, MartinJohnson, Henric

Search in DiVA

By author/editor
Erlandsson, FredrikBródka, PiotrBoldt, MartinJohnson, Henric
By organisation
Department of Computer Science and Engineering
In the same journal
Entropy
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 431 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 3207 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf