Abstract
With Tweet volumes reaching 500 million a day, sampling is inevitable for any application using Twitter data. Realizing this, data providers such as Twitter, Gnip and Boardreader license sampled data streams priced in accordance with the sample size. Big Data applications working with sampled data would be interested in working with a large enough sample that is representative of the universal dataset. Previous work focusing on the representativeness issue has considered ensuring the global occurrence rates of key terms, be reliably estimated from the sample. Present technology allows sample size estimation in accordance with probabilistic bounds on occurrence rates for the case of uniform random sampling. In this paper, we consider the problem of further improving sample size estimates by leveraging stratification in Twitter data. We analyze our estimates through an extensive study using simulations and real-world data, establishing the superiority of our method over uniform random sampling. Our work provides the technical know-how for data providers to expand their portfolio to include stratified sampled datasets, whereas applications are benefited by being able to monitor more topics/events at the same data and computing cost.
Original language | English |
---|---|
Title of host publication | ECAI 2016 |
Publisher | IOS Press |
Pages | 1212-1220 |
Number of pages | 9 |
Volume | 285 |
ISBN (Print) | 97816149967291212 |
DOIs | |
Publication status | Published - 02 Sep 2016 |
Event | 22nd European Conference on Artificial Intelligence - The Hague, Netherlands Duration: 29 Aug 2016 → 02 Sep 2016 http://www.ecai2016.org/ |
Publication series
Name | Frontiers in Artificial Intelligence and Applications |
---|---|
Publisher | IOS |
ISSN (Print) | 0922-6389 |
Conference
Conference | 22nd European Conference on Artificial Intelligence |
---|---|
Abbreviated title | ECAI 2016 |
Country/Territory | Netherlands |
City | The Hague |
Period | 29/08/2016 → 02/09/2016 |
Internet address |