Leveraging Stratification in Twitter Sampling

Vikas Joshi, Deepak Padmanabhan, LV Subramaniam

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With Tweet volumes reaching 500 million a day, sampling is inevitable for any application using Twitter data. Realizing this, data providers such as Twitter, Gnip and Boardreader license sampled data streams priced in accordance with the sample size. Big Data applications working with sampled data would be interested in working with a large enough sample that is representative of the universal dataset. Previous work focusing on the representativeness issue has considered ensuring the global occurrence rates of key terms, be reliably estimated from the sample. Present technology allows sample size estimation in accordance with probabilistic bounds on occurrence rates for the case of uniform random sampling. In this paper, we consider the problem of further improving sample size estimates by leveraging stratification in Twitter data. We analyze our estimates through an extensive study using simulations and real-world data, establishing the superiority of our method over uniform random sampling. Our work provides the technical know-how for data providers to expand their portfolio to include stratified sampled datasets, whereas applications are benefited by being able to monitor more topics/events at the same data and computing cost.
LanguageEnglish
Title of host publicationECAI 2016
PublisherIOS Press
Pages1212-1220
Number of pages9
Volume285
ISBN (Print)97816149967291212
DOIs
Publication statusPublished - 02 Sep 2016
Event22nd European Conference on Artificial Intelligence - The Hague, Netherlands
Duration: 29 Aug 201602 Sep 2016
http://www.ecai2016.org/

Publication series

NameFrontiers in Artificial Intelligence and Applications
PublisherIOS
ISSN (Print)0922-6389

Conference

Conference22nd European Conference on Artificial Intelligence
Abbreviated titleECAI 2016
CountryNetherlands
CityThe Hague
Period29/08/201602/09/2016
Internet address

Fingerprint

Sampling
Costs
Big data

Cite this

Joshi, V., Padmanabhan, D., & Subramaniam, LV. (2016). Leveraging Stratification in Twitter Sampling. In ECAI 2016 (Vol. 285, pp. 1212-1220). (Frontiers in Artificial Intelligence and Applications). IOS Press. https://doi.org/10.3233/978-1-61499-672-9-1212
Joshi, Vikas ; Padmanabhan, Deepak ; Subramaniam, LV. / Leveraging Stratification in Twitter Sampling. ECAI 2016. Vol. 285 IOS Press, 2016. pp. 1212-1220 (Frontiers in Artificial Intelligence and Applications).
@inproceedings{edc0084d038546b6a6a522bb31c892af,
title = "Leveraging Stratification in Twitter Sampling",
abstract = "With Tweet volumes reaching 500 million a day, sampling is inevitable for any application using Twitter data. Realizing this, data providers such as Twitter, Gnip and Boardreader license sampled data streams priced in accordance with the sample size. Big Data applications working with sampled data would be interested in working with a large enough sample that is representative of the universal dataset. Previous work focusing on the representativeness issue has considered ensuring the global occurrence rates of key terms, be reliably estimated from the sample. Present technology allows sample size estimation in accordance with probabilistic bounds on occurrence rates for the case of uniform random sampling. In this paper, we consider the problem of further improving sample size estimates by leveraging stratification in Twitter data. We analyze our estimates through an extensive study using simulations and real-world data, establishing the superiority of our method over uniform random sampling. Our work provides the technical know-how for data providers to expand their portfolio to include stratified sampled datasets, whereas applications are benefited by being able to monitor more topics/events at the same data and computing cost.",
author = "Vikas Joshi and Deepak Padmanabhan and LV Subramaniam",
year = "2016",
month = "9",
day = "2",
doi = "10.3233/978-1-61499-672-9-1212",
language = "English",
isbn = "97816149967291212",
volume = "285",
series = "Frontiers in Artificial Intelligence and Applications",
publisher = "IOS Press",
pages = "1212--1220",
booktitle = "ECAI 2016",

}

Joshi, V, Padmanabhan, D & Subramaniam, LV 2016, Leveraging Stratification in Twitter Sampling. in ECAI 2016. vol. 285, Frontiers in Artificial Intelligence and Applications, IOS Press, pp. 1212-1220, 22nd European Conference on Artificial Intelligence, The Hague, Netherlands, 29/08/2016. https://doi.org/10.3233/978-1-61499-672-9-1212

Leveraging Stratification in Twitter Sampling. / Joshi, Vikas; Padmanabhan, Deepak; Subramaniam, LV.

ECAI 2016. Vol. 285 IOS Press, 2016. p. 1212-1220 (Frontiers in Artificial Intelligence and Applications).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Leveraging Stratification in Twitter Sampling

AU - Joshi, Vikas

AU - Padmanabhan, Deepak

AU - Subramaniam, LV

PY - 2016/9/2

Y1 - 2016/9/2

N2 - With Tweet volumes reaching 500 million a day, sampling is inevitable for any application using Twitter data. Realizing this, data providers such as Twitter, Gnip and Boardreader license sampled data streams priced in accordance with the sample size. Big Data applications working with sampled data would be interested in working with a large enough sample that is representative of the universal dataset. Previous work focusing on the representativeness issue has considered ensuring the global occurrence rates of key terms, be reliably estimated from the sample. Present technology allows sample size estimation in accordance with probabilistic bounds on occurrence rates for the case of uniform random sampling. In this paper, we consider the problem of further improving sample size estimates by leveraging stratification in Twitter data. We analyze our estimates through an extensive study using simulations and real-world data, establishing the superiority of our method over uniform random sampling. Our work provides the technical know-how for data providers to expand their portfolio to include stratified sampled datasets, whereas applications are benefited by being able to monitor more topics/events at the same data and computing cost.

AB - With Tweet volumes reaching 500 million a day, sampling is inevitable for any application using Twitter data. Realizing this, data providers such as Twitter, Gnip and Boardreader license sampled data streams priced in accordance with the sample size. Big Data applications working with sampled data would be interested in working with a large enough sample that is representative of the universal dataset. Previous work focusing on the representativeness issue has considered ensuring the global occurrence rates of key terms, be reliably estimated from the sample. Present technology allows sample size estimation in accordance with probabilistic bounds on occurrence rates for the case of uniform random sampling. In this paper, we consider the problem of further improving sample size estimates by leveraging stratification in Twitter data. We analyze our estimates through an extensive study using simulations and real-world data, establishing the superiority of our method over uniform random sampling. Our work provides the technical know-how for data providers to expand their portfolio to include stratified sampled datasets, whereas applications are benefited by being able to monitor more topics/events at the same data and computing cost.

U2 - 10.3233/978-1-61499-672-9-1212

DO - 10.3233/978-1-61499-672-9-1212

M3 - Conference contribution

SN - 97816149967291212

VL - 285

T3 - Frontiers in Artificial Intelligence and Applications

SP - 1212

EP - 1220

BT - ECAI 2016

PB - IOS Press

ER -

Joshi V, Padmanabhan D, Subramaniam LV. Leveraging Stratification in Twitter Sampling. In ECAI 2016. Vol. 285. IOS Press. 2016. p. 1212-1220. (Frontiers in Artificial Intelligence and Applications). https://doi.org/10.3233/978-1-61499-672-9-1212