Last updated: | Permalink
Discussion: Mining Software Repositories
These notes roughly capture the (mostly unedited) comments from the class:
General discussion
When is a study “exploratory” vs “definitive?”
- Even if you say “exploratory” people still jump to conclusions
- Depends on the context, maybe
- Transparency in data/results/study can help future work to confirm/refute/clarify
Discussion: “The Promises and Perils of Mining GitHub”
What is wrong with using GitHub for personal projects, wrt perils?
- Heavily dependent on what you are trying to study - “well-engineered” vs… personal
- Not “wrong” to use GH for personal projects, but maybe bad to assume that they are well-engineered. In fact: having a dataset of these “less engineered” projects could be useful for some applications
- “Contaminates data” - across categories
- The purpose of GitHub is not to provide a database for mining :)
- There are a variety of ways to “sanitize” the data: Committers, commit frequency, recency, presence of forks/pull-requests/issues
Perils of linking issues with code: Biased samples based on population, bugs in bug trackers aren’t all bugs, commercial projects != open source projects, large sample size “might” help
GHTorrent - an interesting resource
Study design
- 1,000 GH users get an open-ended survey, 240 responses
- relatively small, biased to those who had published their emails
- would bigger survey result in different conclusions?
- reach saturation?
- Qualitative analysis of metadata
- Manual analysis for 434 projects
The perils
- Repository != project
- Forks
- What about those 11k commits in forks not merged into upstream?
- Activity is measured in commits
- Most projects have very few commits
- Most projects inactive
- What is “active” - commit in past 6? months
- Should we measure activity in terms of users? (How to measure usage, how to correct for popularity)
- Note that “inactive” != “dead” - maybe important to distinguish “used” and “active” and “alive”
- Most active 2.5% of projects account for remaining 97.5% of projects
- Large portion of repos not for software development
- used just for storage. Configuration files, resumes, datasets, etc.
- not “well engineered projects” but something else
- Static websites
- “Assets” other kinds of blobs, etc.
- Books (“books as code”)
- 2/3 repos personal
- 90/240 respondents said they use GH mostly for personal projects and not with intention to communicate
- Maybe connected to the response rate/bias here - who posts email/responds/has social projects?
- Followed-up based on quantitative analysis of how many commiters there are
- There are personal projects that have multiple fly-by committers
- There are significant community projects who primarily have single committers
- PRs are valuable, but not used everywhere
- PRs can be reworked and lose discussion
- Most PRs appear non-merged even though merged
- Could this be a UI problem, now fixed?
- Can lead to inaccurate data for studies that look at what drives “success” in merges
- Bot PRs
- Many active projects don’t use GitHub exclusively
- Mirrors
- Hard to expect everything in one place
The promises
- PRs are valuable source of data
- Linking developers, PRs, issues and commits
Things that are absent:
- Mining the CODE itself
- Repository templates - where does this fit?
- Forks that you cannot tell are forks
Discussion: “A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts.”
- Motivation/intro
- Jet brains research
- What are interesting questions to ask of large-scale notebook datasets?
- Is code in notebooks different? (Procedural vs exploratory)
- Is the code written for different reasons?
- How many users have both (we can link committers!)
- How often are notebooks used collaboratively vs individually?
- What are the most popular languages that are used for notebooks?
- Are notebooks “actively developed” in the same way of non-notebooks
- Who is using these tools, and why? (Students, domain scientists,…)
- Is the style of the code different depending on the context?
- Why migrate to and away from notebooks?
- Examine which libraries are used in the notebooks, cluster usage, popularity, etc
- Methodology
- Metrics
- Structural metrics
- Number of built-in functions, user-defined functions (definitions? usages?), API functions
- Cyclomatic complexity - proxy measure for understandability. (Maybe some confounding things)
- Number of imported functions
- Cell coupling
- How many variables are referenced between cells
- Extended comment LOC
- Comments + markdown lines
- Stylistic metrics
- Linters
- Only ran linters on 100k files
- How slow was this to run? Why was that the case?
- Structural metrics
- Datasets
- All jupyter notebooks in Nov 2020 (9.7m) excluding in fork repos
- Took 1 month
- 1.7m projects
- Only python notebooks: 8.3m
- Filtered to only those with “permissive” license: 847,000
- “Copyleft” - if you change the code and distribute the compiled code you must give away your changes to source
- “Permissive” - do whatever you want
- Sampling?
- They included all that met these criteria?
- Compare notebooks to non-notebooks:
- 10k most-starred projects on GH with permissive licenses
- Sampling?
- Treated every single file separately, didn’t weight per-repo
- What KINDS of projects are these?
- All jupyter notebooks in Nov 2020 (9.7m) excluding in fork repos
- Results
- “Has no effect” linting rule is not helpful, it occurs widely
- “Uses the same variable name” repeated IS something where there might be a helpful intervention
- Jupyter notebooks “are more error-prone”
- Were the non-notebook files listed normally?
- Also has to do with the use cases
- Function usage and complexity?
- Far fewer functions in notebooks than scripts
- More usage of functions/APIs in notebooks than scripts
- Implications?
- Different programming models wrt structure
- Notebooks were longer
- Notebooks might “drag” on and on [let’s look at evolution!]
- Diff’ing notebooks :’(
- Are there raw numbers?
- Invalid statistical tests
- Metrics