Ideas

The page is constantly being updated
Topic 1: Automated discovery of public-private corruption cases  (Inforegister)
#InforegisterAPI

Corruption is a major obstacle to sustainable economic, political and social development. Overall, corruption reduces efficiency and increases inequality. CleanGovBiz estimates that the cost of corruption equals more than 5% of global GDP. Hence effective means to prevent corruption have significant effect to development.

Corruption is also an issue in Estonia. In fact, the number of revealed corruption cases is on the rise – registered number of corruption offences were 161 and 450 in 2012 and 2015 correspondingly.

This project aims at reducing public-private corruption in Estonia by providing a mechanism for automatically revealing corruption patterns and using the mechanism to discover corruption cases already in their early stages. The approach is to apply a combination of social network analysis and machine learning techniques to analyze temporal networks of organizations, persons and assets (tenders, financial aid, real estate objects, …) in order to find temporal network patterns, which describe existing corruption cases.

Inforegister will provide the following data for the purpose of the project:

  1. Board members and owners of businesses
  2. Some features of businesses
  3. Real estate ownership data

In addition the project will benefit from the following datasets:

  1. Public tender data (sums, descriptions and winners of tenders)
  2. Grants, financial aid and subsidies
  3. Public sector officials/employees
  4. Corruption cases (specific organizations and persons) for learning

This project is proposed in conjunction with the agenda of EVUL (Eesti Võlausaldajate Liit), which is a non-profit organization aiming at increasing transparency of making business in Estonia. Read more from here (in Estonian).

Anyone interested in participation? Contact me (peep.kungas@ir.ee).

Topic 2: Does lower credit risk imply better well-being? (Inforegister)
#InforegisterAPI

The core hypothesis of this project is that well-being strongly correlates with business risks of companies. If this indeed is true, then developing effective policies to reduce internal risks (for instance operational risks such as credit risk) and external risks (such as systematic risk) of companies, well-being can be systematically improved.

Moreover, compared to well-being indices (such as Gini index), which are measured on annual basis, there are models to measure business risks on daily basis (e.g.  https://ir.ee/lahendused/krediidiskoor). If there is a strong correlation between business risks and well-being indices, then the year-long delay in measuring the effect of policy changes to well-being can be reduced to virtually zero by using the business risk evaluation models. The latter allows raising the quality of decision-making wrt well-being to a whole new level.

This project aims at developing a methodology for measuring a correlation between business risks and well-being. Also as proof-of-concept, a dashboard is developed for decision-makers, which provides longitudinal and regional visualization of business risks in Estonia.

Inforegister will provide the following data for the purpose of the project:

  1. Some features of businesses
  2. Time-series of business credit scores of Estonian businesses

In addition the project will benefit from the following datasets:

  1. Regional unemployment statistics
  2. Well-being metrics

For more questions contact peep.kungas@ir.ee.

Topic 3: Segregation in boards of companies for now-casting labor market changes (Inforegister)
#InforegisterAPI

The core hypothesis of this topic is that distribution of gender and age segregation reflect a change in the labor market. There are some initial findings ( e.g. isolation index negatively correlates with an unemployment rate in case of young men in Lääne-Virumaa (2008-2015)), which seem to confirm this hypothesis, but further studies are required to confirm the hypothesis. Furthermore, researchers at UNIPI have developed models for measuring segregation in boards of companies for which input data is available on daily basis.

Hence, usage of segregation metrics to now-cast (un)employment means in practice that quarterly delays in measuring the effect of policy changes to (un)employment can be reduced to virtually zero. The latter allows raising the quality of decision-making wrt (un)employment.

Inforegister will provide the following data for the purpose of the project:

  1. Some features of businesses
  2. Board membership data

In addition the project will benefit from the following datasets:

  1. Regional unemployment statistics, e.g. http://pub.stat.ee/px-web.2001/I_Databas/Social_life/09Labour_market/04Employed_persons/02Annual_statistics/02Annual_statistics.asp
  2. Annual segregation metrics since 1997-2016 (calculated on 1st of January basis for each year) provided by Alessandro Baroni from UNIPI
Topic 4: Investments to Estonian startups (Taivo Pungas)
#EstonianStartups

There are great data about the financing of Estonian startups (starting from 2006). Just visualizing this data would be exciting to a wide audience, but it would be even more exciting to look into which start-ups have ended their business. That way we could map the development process of a typical Estonian startup: when and how much financing is received, how likely it is to get to the next investment cycle, how long does the process take, etc.

Topic 5: Salaries of Estonian public officials (Taivo Pungas)
#PublicOfficialsSalaries

The information about salaries for all officials working in Estonian public offices and local governments are publicly avaliable. Based on this information one could research:

  • How have the salaries changed? Has the mean salary increased? How is it different from Estonian mean salary and is it changing at the same pace?
  • What is the churn rate among public officials? In which organization is it the largest? How many workers change their job for another public office job?
  • Has the total number of workers/workplaces increased? Does it differ in different organizations?
Topic 6: Visualising the books published in Estonia (Taivo Pungas)
#books

Estonian National Library has a dataset about all books published in Estonia (there are contacts on the webpage, where you need to write to get the full dataset). With this data, you can answer interesting questions like:

  • How much has the number of published books changed?
  • How much has the number of published books changed in terms of topics?
  • Which authors, publishers have published the most?
  • Which words appear in the titles the most/least?
Topic 7: Analysing horoscope texts (Taivo Pungas)

Analysing the predictive power of horoscopes wouldn’t be a very reasonable project, but it is much cooler to ask: are the predictions same for all zodiac signs? This has been answered for horoscopes written in English by Information is Beautiful. This has not been done for Estonian horoscopes. You can find data by googling. Because of the morphology of Estonian language, we have to preprocess the text (for example to count the words we need to make “põrand” and “põrandal” the same). It is possible to do it for example using estnltk library in Python.

Topic 8: Facebook posts of Estonian political parties (Taivo Pungas)
#PoliticalPartiesFacebook

All Facebook posts of Estonian political parties (with comments and reaction numbers) are conveniently available. Using this data one can research into many exiting questions:<\p>

  • How are the used words different between the parties? Has this changed with time?
  • To which webpages are the parties linking the most?
  • How many and which kinds of reactions do the posts get? Has it changed with the change of coalition? Are there differences between the parties?
  • What is the polarity of the text?
Topic 9: Analysing the work of Estonian parliament (Taivo Pungas)
#EstonianParliament

You can access all voting results and verbatim records in the parliaments webpage. They are also conveniently stored in data files. The questions that can be of interest are:

  • Who are voting similarly (are the factions voting together)? Are the parties different or does everybody vote the same way in the end?
  • If you can define a distance between two representatives that would say how similarly the vote, you can create a network (graph) and use classical network analysis tools on it – for example do cluster detection.
  • How uniform are the coalitions: how often are they voting the same way? Which bills are causing disagreements?
  • For every representative: how often are they participating in the hearings? Have they spoken out? How have they voted?
You can find inspiration from the page VoteWatch that follows how European Parliament members vote.
Topic 10: Predicting the results of 2017 local elections (Taivo Pungas)
#LocalElections,
#CitizenAge,

Predicting the results of elections is always an exciting topic. The project has two difficult parts: a) collecting the data and cleaning it and b) building a good prediction model. If this seems too intimidating it is also ok to just visualize the data. For example it would be great to know the budget of the local government (budget data) in relation to voting age citizens (citizen data). This would give information about how many euros does one voter’s vote influence (assuming everyone votes). It would be even more interesting to research in which local government one person’s vote is more likely changing the election results.

Topic 11: Recommendation system (STACC)
Some public datasets to try ideas: #OpenFoodFacts, #Yelp, #Dataverse,

A recommendation system is a subclass of information filtering system that seeks to predict the “rating” or “preference” that a user would give to an item [Wikipedia]. Recommendation engines are used to increase revenue and user satisfaction in many different types of business, but could it be used to improve people’s quality of life? We challenge you to create an innovative use of recommendation system that promotes social good. Can you help people eat healthier? or find political candidates with matching ideas? or maybe provide people with information that help them enjoy city public spaces? Be creative, and good hacking!

Simple RS in python: Some public datasets’ collections: #EuropeanFoodSafetyAuthority, #EUOpenData, #DataverseDB, #KaggleDatasets, #OpenStreetMaps #UCI #AmazonDatasets
Topic 12: Analysis of terrorism data
#TerrorismData

Terrorism has become a global issue in the recent years. Using public data related to terrorism you can help the society to obtain the current picture, and answer the following questions:

  • Have attacks become more deadly?
  • Does the number of terrorist attacks depend on the weapon and region, etc.
Topic 13: Monitoring of crowdsourced processes (Estonian Cooperation Assembly)
#EstonianCooperationAssembly

In leading important societal processes there comes the moment when reaching results are up to others, either policymakers or officials. How to follow the progress and force implementation of a crowdsourced process that intends to change the way our society works? The People’s Assembly on the Future of Ageing needs your help in working out a follow-up mechanism for proposals on pension system reform.

Watch example initiative here (EST).

You may choose any of the initiatives You feel is most engaging and do some really cool stuff your team will come up to. Take a look here (EST).

Pension vision: https://pension2050.kogu.ee/visioon/#currentsituation

Rahvaalgatus.ee FE code: https://github.com/rahvaalgatus/rahvaalgatus

Uuseakus.rahvaalgatus FE is using rahvaalgatus.ee code.

Topic 14: Analysis of court cases data (TEXTA)
#TextaAPI

Predict the court case outcome based on the known information from the collection of resolved cases. Data will be available through TEXTA Toolkit.

Topic 15: Analysing water monitoring data (RMK)
#RMK

Need veeseireandmed kujutavad endast ligi 40 geograafilises punktis iga tunni tagant registreeritud veesamba kõrguseid, mis on teisendatud veetasemeteks maapinna suhtes. Vanimad andmeread algavad 2012. aastast, lühimad 2015. aasta sügisest. Andmeid kogume selleks, et hinnata soode taastamise (kuivenduse likvideerimise) tulemuslikkust. Nii et tegemist tõesti mahuka andmestikuga, millega tegelemine võiks tudengitele huvi pakkuda.

Mõned mõtted probleemidest, mida võiks lahendada:
  1. Kuna seireseadmed on ostetud erinevate hangetega, siis on tegu erinevate valmistajate toodetega, mis tähendab ka seda, andmete väljund on erinev. Meie huvi on aga panna kõik andmed ühte andmebaasi ja teha seda võimalikult automatiseeritult. Probleemiks on see, et erinevate valmistajate puhul on andmete järjekord väljundfailis erinev, samuti ka näiteks ajastambi erinev formaat. Seega peame väljundfailid oma vajaduste järgi ära standardiseerima.
  2. Kuna tegemist on pikkade aegridadega, siis andmete kvaliteedi kontroll visuaalselt tabeleid või graafikuid sirvides ei ole eriti produktiivne tegevus. Aga vahel juhtub, et seade registreerib näidu seirekaevu kontrollimise ajal või tulevad elektroonikast johtuvalt imelikud fluktuatsioonid. Seega oleks andmestikust vaja välja filtreerida andmed, mis märkimisväärselt erinevad eelmise ja järgmise tunni andmetest, ja asendada see erand eelmise ja järgmise tunni keskmisega.
  3. Oleme mõelnud ka, et võiks olla avalikkusele suunatud veebirakendus, kus näha kaardil taastamisalade asukohad ning neil klikates oleks võimalik kuvada veetasemete graafikud.
  4. Viimaks ka selline klassikaline küsimus, et kas suvised veetasemed peale taastamistöid on kõrgemad kui enne.
Topic 16: ASSET Challenge: Smarter Sustainable Shopping (ETH Zurich)
#ASSET

ASSET Challenge: Smarter Sustainable Shopping is related to extracting meta-information from product data that can be used to make more sustainable purchase decision.

Participants will be asked to perform text mining, topic modeling and text summarization over product data and compute correlation/sentiment values that can be used to rate products according to sustainability criteria.

Participants will be provided with the scripts, data and tutorials.

The discussions related to the ASSET are held in Slack channel #asset-challenge.

This challenge will offer up to three coupon prizes of value 1000, 600 and 400 euros

Topic 17: Loxodon challenge (Loxodon)
#Loxodon

You see them on almost every all highways all over the world, cameras with Automatic Number Plate Recognition (ANPR). There are multiple tools to chop movies made by those camera’s up into images then detect a number plate on an image and finally discover the letters on the number plate itself. The algorithms are tuned during the years by many experts and the quality of is very high. But, did you recognize that every car also has a logo of the brand? Loxodon challenges you to come up with a mechanism to detect car logos on these movies with the same precision as the number plate detection!

Data provided

We will provide you small movies with cars on it and a database with car logos of most common car manufacturers. Furthermore, you can use the web to gather more movies and even make movies with your own cellular phone of cars.

Scoring

During the assessment we look at the stability of the chosen solution, the percentage of the correct hits on a randomly chosen test set and the professionality how you present your solution.

What do we offer

If your team (max team size is 5) becomes number one, Loxodon sponsors a trip to Amsterdam (The Netherlands)! There you can present your solution to the Loxodon experts and enjoy the weekend as a tourist (Flight tickets and B&B).

Topic 18: Child welfare (Ministry of Social Affairs of Estonia)
#ChildWelfare

Analysis of data related to different aspects of children’ life can provide useful information for the families and social services, adjust pediatric patients care etc. We offer a rich collection of data gathered in Estonia and abroad. A wide range of questions can be addressed using this data, i.e. related to the health, physical activity, depression, etc. One of the ideas of the project is to represent collected information related to children’s problems on the website in the engaging user-friendly way: http://denhov.com/test/ (website is work-in-progress).

Topic 19: Prediction of dog competition results
#DogsComp

TThe results of obedience competitions for dogs are based on various factors. Team is offered to predict the score of the upcoming competitions for any dog based on the information from the previous competitions, i.e. dog breed, dog age, handler, judge, previous scores (in different classes), organizing club.

Topic 20: Registry of Changemakers (Social Enterprise Network of Estonia)

Social Enterprise Network of Estonia

The vision: to change the way society accounts for value. (It is also the vision of Social Value International.)

The aim: to map and publish positive changes in human well-being and natural environment created by charities and social enterprises as well as the programs of CSR companies and public sector.

What we do have: beta version of a web solution called Maailmamuutjad.ee (in English: Changemakers.ee). The web site helps to create, publish and search for reports about the objectives and results of the organizations that claim to change the world for the better. Currently, the focus is on Estonia but there is high potential for scaling it up globally.

Web solution currently includes:

  • Methodology for setting societal impact objectives, defining indicators and providing quantitative as well as qualitative evidence that positive changes have indeed been achieved;
  • Methodology how to structure impact information so that information from individual organizations can be visualized on a geographical map;
  • Visualization solution using Google Map;
  • Four sample reports of varying quality.
The challenges for SIDH2017:
    • How to develop the interface of the site so that the most important information would be presented in a more attractive and easy-to-understand manner? E.g. What would be the opportunities for data visualization? How to improve the design of the report (both web and pdf version)? How to present and aggregate key performance indicators of the organization’s social impact? Etc.
    • How to develop web site´s content management system (e.g. methodology) so that more valuable data would be collected more easily and better analysis/comparisons of data would be enabled? (CMS is currently based on WordPress).
The whole system is currently in the Estonian language. However, the representative of Estonian Social Enterprise Network is willing to make an extra effort so that all the necessary information for the team´s work would be available in English during the hackathon.
Topic 21: Public data and your future home (Aleksandr Michelson)

Why? When you choose your future home location, you rely mostly on the data published in the real estate ad, your personal experience of the property or real estate agencies and their estate agents. There should be a comprehensive tool that would make your residential choice more transparent, well-informed and less time-consuming and would enable the data to serve the interests of a greater number of people, thereby contributing to social equality by eliminating information barriers. The choice should be based on the objective public data.

How? There is a wide range of databases, data from monitoring activities and from the research studies: data on the Estonian Land Board Geoportal maps, the land register, urban planning databases, assessments of major accident risks, flood areas, noise maps, climate monitoring data, sun path, hiking trails, recreational areas, location of military objects, roads used by cargo traffic, and so on. Why not combine all such data to improve the process of making a residential location choice of the all Estonian people?

It already feels to be quite a challenging project, doesn’t it? If you like to face challenges, then this is the project you are likely to contribute to.

During the SIDH2017 event you could find solutions to identifying:

  • The best ICT way(s) of making a well-informed residential location choice (e.g. main tools, user interface, and so on);
  •  The best ways of organising and managing this innovative solution (e.g. public / private, governmental / non-governmental, and so on);
  • The sustainable ways of constant collecting and updating the data (e.g. human / non-human data input, cooperation types with various actors, and so on);
  • The next possible steps after the SIDH2017 event.
Topic 22: Spatial visualisation of Estonian registered unemployment data (Andres Võrk)
#UnemploymentByMunicipalities , #EstonianMunicipalities , #Unemployment

Using monthly registered unemployment data one could build an application (e.g in R Shiny) that:

  • visualises registered unemployment data on the map of Estonia
  • visualises and quantifies spatial patterns and changes (trends, cycles, seasonality); one could use statistical tools from spatial analysis (e.g. spatial correlation coefficients)
  • explains changes in spatial patterns by changes in other factors (e.g. changes in demographics, location of enterprises, etc.)
Possible problems: a few municipalities have merged over the years.
Topic 23: Visualisation of Estonian registered unemployment data by occupation over time (Andres Võrk)
#UnemploymentByOccupation , #EstonianCounties

Using Estonian registered unemployment data by occupation one could:

  • visualize registered data by occupation on the map of Estonia
  • visualize and quantify changes in occupations of unemployed people
  • explain changes in occupations of unemployed people by changes in other factors (e.g. changes in demographics, economic sectors, etc.)
Topic 24: Visualisation and monetization of alcohol benefits and costs for Estonia (Andres Võrk)
#AlcoholConsumption ,

In Estonia, alcohol production and consumption creates both benefits (employment, tourists, tax revenues, etc.) and costs (health care, violence, accidents, etc.). Currently, these costs and benefits are not visualized in an easy and comparative way. The purpose would be to collect the existing statistical data, add necessary assumptions from the literature or let the user choose from a reasonable range of values, and visualize the results of benefits and costs either on the webpage or as an R Shiny application. The purpose is to make this data easily accessible for stakeholders, (journalists, politicians, and lobby-groups). It must be accompanied with additional assumptions on the monetary value of alcohol-related harm (e.g. value of human life, costs of accidents, etc.). There are many related studies (some only in Estonian), which must be combined.

Topic 25: Visualisation and analysis of Estonian Youth Work Centre 200 action plans (Kati Nõlvak)

Since this year Estonian government is giving money to local municipalities to improve access to hobby education and activities and to offer more diverse opportunities for young people to participate. To get that financing, they had to present action plans where there would be described main obstacles, actions to improve them and budget. Now Estonian Youth Work Centre has more than 200 action plans and here comes the problem/idea: how to put those action plans into a better use? How to analyze and visualize data given in those action plans in order to encourage peer-to-peer learning between different local municipalities and to raise awareness of the local communities about the new possibilities? More information (in Estonian): www.entk.ee

Topic 26: Knowledge diversity challenge: Is Wikipedia biased? What can we infer from data?

Wikipedia is criticized of having bias to political left [1], at the same time online encyclopedia itself struggles with gender gap [2] and the knowledge content itself has challenge to make use of cultural bias of different language editions [3].

For Social Impact Data Hack we propose a challenge to dig into the diversity of Wikipedia content with Wikidata queries. [4] Example ideas for elaborating:

  • Differences in coverage of political issues in Estonian, Russian and German Wikipedia.
  • Gender bias in topics of Estonian Wikipedia (compared to some other Wikipedia editions).
  • Cultural bias in knowledge structures of Estonian Wikipedia.

Example queries for gender bias, 1) distribution of occupations in biographical articles http://tinyurl.com/y973aawx 2) the same for articles about women http://tinyurl.com/ya6z7zrh

Estonian Wikipedia has not been analyzed from a diversity point of view, however, among Wikipedians there is common belief that Estonian Wikipedia is among less gender-biased Wikipedia editions in the world. A Cultural and political bias of Estonian Wikipedia has not been studied, however, the general claim of bias to left is usually acknowledged.

From the methodological point of view, the data at Wikipedia is not statistically balanced, however, it can be used to observe general tendencies, construct reasonable hypothesis or create visual insights to problems of our intuitive knowledge structures.

  • https://www.wired.com/story/welcome-to-the-wikipedia-of-the-alt-right/
  • http://mashable.com/2016/12/08/wikipedia-gender-gap/
  • https://meta.wikimedia.org/wiki/Grants:Project/Wikipedia_Cultural_Diversity_Observatory_(WCDO)
  • https://ee.wikimedia.org/wiki/Wikidata_workshops_with_Asaf_Bartov_in_2017_(Tallinn,_Tartu)#Example_queries (edited)
Bitnami