GitHub code search helps developers find critical information across codebases. While developers consider the search tool a step in the right direction, they also see areas for improvement around search results and support for semantic queries.
GitHub built its code search tool, released as a public beta in November 2022, as an alternative to traditional search engines because the information developers need is often buried within code and not locatable on the internet, said Colin Merkel, code search senior engineer at GitHub, in a presentation at GitHub Universe 2022.
Traditional search engines don't always meet developers' needs because the results often contain a short example or a long article instead of the actual code containing the query, said Ryhor Supruniuk, iOS tech lead at Orangesoft, a mobile app development company headquartered in San Francisco. GitHub code search enables developers to take advantage of mature, well-documented open source code and the collective knowledge of other developers, he said.
"GitHub code search gives you a detailed view of the case so that you can see how the technology works in action and what challenges await you in the development," Supruniuk said.
Once clicked, the search interface offers suggestions to construct queries. For example, if developers search their own code by typing in "owner:jsmith license," they get suggestions for files and symbols containing the text "license" across all the repositories they own.
As a result of developer feedback from the initial December 2021 preview, GitHub added three capabilities in November 2022: a new search interface, a new search engine and a redesigned code view. The right symbols pane gives contextual and symbolic information about the search query; the redesigned code view includes the file tree on the left panel and helps developers find related files.
"I love it, and in particular, the results view with facets on the left is super useful for me," said Chris Riley, senior manager of developer relations at marketing tech firm HubSpot in Cambridge, Mass.
GitHub code search pain points
While a regular expression such as "/git.*commit/" can quickly produce results after searching billions of lines of code, some developers say the search results can get cluttered. For example, posters on one Hacker News thread noted that duplicates, such as forks or seldom-used version control directories and files, sometimes dominate search results.
While Git is the dominant version control system on GitHub, Subversion is an old-school version control system that handles less than 0.02% of requests, according to GitHub, which plans to sunset support for Subversion in January 2024. Thus, inclusion of Subversion files in search results makes no sense in GitHub, Riley said.
"It makes me wonder what weird artifacts could come over that introduce build issues, [because] 'it worked on my machine' or 'I guess it worked on their machine,'" Riley said. "Versioning can be a serious issue ... and this could introduce all sorts of problems."
GitHub is working on improving results ranking so that users will see fewer low-quality results, said Tim Clem, GitHub staff software engineer. The team also plans to improve filtering options so that users can exclude forks, for example, and might cluster similar results to improve the quality of results.
GitHub code search supports Boolean expressions such as OR, AND and NOT. For example, to search for all code that is within markdown or ends with "txt," a developer can type in "owner:jsmith (lang:markdown OR path:*.txt.)." But while the Boolean search is very effective, the challenge is knowing what to search for, Riley said. That pain point could be alleviated with more intuitive search capabilities.
Charlotte DunlapResearch director, GlobalData PLC
"If they started to offer more natural language searches so you could do conceptual searches like 'a function that finds the shortest path,' that might be a neat way to build on top of Copilot to assist a developer to build better applications faster," Riley said.
GitHub is keeping mum on the matter of when and if GitHub code search will improve its ability to parse natural language queries.
"We've heard a lot of interest in semantic search, but don't have any details to share now," Clem said.
Still, semantic search might be inevitable.
"GitHub has no choice but to integrate natural language processing into its search engine," said Charlotte Dunlap, research director at GlobalData PLC, a British market research firm.
Up until now, developers have had to spend hours poring over Google looking for relevant code, Dunlap said. The more that generative AI such as ChatGPT can automate that process, the better for the army of enterprise developers trying to keep up with enterprise application modernization needs, she said.
Although GitHub now has the intellectual capital to adopt OpenAI and differentiate itself from competitors, what this meld of technologies will mean for software development is unclear, said Larry Carvalho, principal consultant at RobustCloud.
"The real value of OpenAI in GitHub will only be measurable after usage by developers in real-life application development scenarios," Carvalho said.
Developers can now sign up for the beta waitlist at GitHub.com.