Submission information
Submission Number: 110
Submission ID: 207
Submission UUID: 355a10da-2019-44cb-a2bc-ff935c8657b8
Submission URI: /form/project
Created: Fri, 09/17/2021 - 10:16
Completed: Fri, 09/17/2021 - 10:17
Changed: Tue, 08/30/2022 - 15:22
Remote IP address: 73.89.101.1
Submitted by: Katherine Nelson
Language: English
Is draft: No
Webform: Project
Project Title: US Tax Code to Natural Language Parsable Data for Programming Languages Program: CAREERS (323) Project Image: https://support.access-ci.org/system/files/webform/project/207/nlp_tax.png Tags: natural-language-processing (274), python (69), sql (424) Status: Complete Project Leader -------------- Project Leader: Phillip Bradford Email: phillip.bradford@uconn.edu Mobile Phone: {Empty} Work Phone: {Empty} Project Personnel ----------------- Mentor(s): Henry Orphys (1433), Thomas Langford (510) Student-facilitator(s): Krutika Patel (1455) Mentee(s): {Empty} Project Information ------------------- Project Description: This project is to prepare a subsection of US Tax code for Natural language translation for either https://catala-lang.org/ or ErgoAI of Coherent Knowledge http://coherentknowledge.com We hope to get initial basic translations from US Tax code into either Catala-Lang or ErgoAI within some threshold of acceptability. Once we get the basic translations within some threshold we want to see if a local startup can complete the translations by engaging humans with expertise in the US Tax code. This very early stage startup (Neutral Tax Networks, Greenwich CT) is in the formative stage and has a patent in this area while developing other intellectual property. Link to House of Representatives site where US code is located. Internal Revenue Code is Title 36 (part way down on this list): https://uscode.house.gov/browse/prelim@title26/subtitleA/chapter1/subchapterA/part1&edition=prelim Here is the IRS website page that contains the links to the internal revenue code and regulations that are provided as a public service by Cornell law schools Legal Information Institute: https://www.irs.gov/privacy-disclosure/tax-code-regulations-and-official-guidance#irc Link to the internal revenue code sections provide dry Cornell’s Legal Information Institute (accessible by clicking on one of the links on the IRS website): https://www.law.cornell.edu/uscode/text Project Information Subsection ------------------------------ Project Deliverables: - All legal tax code XML from https://www.irs.gov/…, https://uscode.house.gov or or https://www.law.cornell.edu/uscode/text transformed into English tax code. - A validated algorithmic mapping from the English legal tax to a format for storing in a relational database. - Store all the legal tax code in a relational database (MySQL) using the mapping that is suitable for translation to ErgoAI or Catala-lang Project Deliverables: {Empty} Student Research Computing Facilitator Profile: They should be able learn Python or know how to code in Python or similar language. They should be able to learn to parse XML with Python. They should also be able to learn to work with one of several Python NLP libraries such as NLTK ( https://realpython.com/nltk-nlp-python/ ) Mentee Research Computing Profile: {Empty} Student Facilitator Programming Skill Level: Some hands-on experience Mentee Programming Skill Level: {Empty} Project Institution: University of Connecticut - Stamford Project Address: Stamford, Connecticut Anchor Institution: CR-Yale Preferred Start Date: {Empty} Start as soon as possible.: Yes Project Urgency: Already behind3Start date is flexible Expected Project Duration (in months): 6 Launch Presentation: {Empty} Launch Presentation Date: 12/08/2021 Wrap Presentation: {Empty} Wrap Presentation Date: 06/08/2022 Project Milestones: - Milestone Title: Capture tax code Milestone Description: Pulling all legal tax code XML from https://www.irs.gov/…, https://uscode.house.gov or or https://www.law.cornell.edu/uscode/text - Milestone Title: Transform tax code to English Milestone Description: Transform all XML into English legal tax code text out of the IRS XML tax-code - Milestone Title: Design organizational mapping Milestone Description: Validate a useful organizational mapping so the English legal tax code text is stored in a relational database (MySQL). This will likely require NLP processing of the tax code to make it suitable for ErgoAI or Catala-lang. This is the first part of the threshold of acceptability. - Milestone Title: Apply organizational mapping Milestone Description: Apply the organizational mapping to all of the tax code. - Milestone Title: Store mapped tax code Milestone Description: Store the mapped legal tax code in a relational database (MySQL) - Milestone Title: Leveraging mapped tax code for deduction Milestone Description: Validate the mapped legal tax code can be expressed as basic terms in ErgoAI or Catala-lang. This is the final part of the threshold of acceptability for the tax code translation. Github Contributions: {Empty} Planned Portal Contributions (if any): {Empty} Planned Publications (if any): {Empty} What will the student learn?: The student will learn how to parse, transform (using XSLT), and store the transformed data. The student will learn to parse the English tax code using Python NLP library such as NLTK. The student will learn to organize the legal text for storage to make retrieval easy and mapping easy to either ErgoAI or Catala-lang. The student will learn some data architecture. This transformed/organized tax code will be stored in a relational database such as MySQL. The student will learn SQL and how to interact with a relational database through a database workbench. The student will learn how to work with a relational database from a language like Python. If there is time, the student will learn about deduction in ErgoAI or Catala-lang. What will the mentee learn?: {Empty} What will the Cyberteam program learn from this project?: {Empty} HPC resources needed to complete this project?: No clear need for HPC. Though the tax code is substantial so there is a possibility the NLP application may require a good deal of CPU cycles. Notes: {Empty} Final Report ------------ What is the impact on the development of the principal discipline(s) of the project?: This project had a solid impact on understanding automated knowledge authoring for legal reasoning. We explored a number of ways to simply transform legal text into logical reasoning in ErgoAI (a variation of Prolog). What is the impact on other disciplines?: {Empty} Is there an impact physical resources that form infrastructure?: {Empty} Is there an impact on the development of human resources for research computing?: Yes - both positive impact on our student, Krutika Patel, as well as positive impact on managing student research. Is there an impact on institutional resources that form infrastructure?: {Empty} Is there an impact on information resources that form infrastructure?: {Empty} Is there an impact on technology transfer?: {Empty} Is there an impact on society beyond science and technology?: Yes - there is an impact towards technology transfer. Besides leadership by Phil Bradford, this project was done with a Connecticut entrepreneur (Henry Orphys) as well as a faculty member (Paul Fodor) from Stonybrook University. Henry has a distinguished law and tax accounting background and he is focused on launching a startup using the technology we explored. Paul is both a faculty member as well as an entrepreneur. We isolated several challenges and better understand the resources necessary for launching a product in this space. Lessons Learned: {Empty} Overall results: {Empty}