Submission Number: 110
Submission ID: 207
Submission UUID: 355a10da-2019-44cb-a2bc-ff935c8657b8
Submission URI: /form/project

Created: Fri, 09/17/2021 - 10:16
Completed: Fri, 09/17/2021 - 10:17
Changed: Tue, 08/30/2022 - 15:22

Remote IP address: 73.89.101.1
Submitted by: Katherine Nelson
Language: English

Is draft: No
Webform: Project
Project Title US Tax Code to Natural Language Parsable Data for Programming Languages
Program CAREERS
Project Image nlp_tax.png
Tags natural-language-processing (274), python (69), sql (424)
Status Complete
Project Leader Phillip Bradford
Email phillip.bradford@uconn.edu
Mobile Phone
Work Phone
Mentor(s) Henry Orphys, Thomas Langford
Student-facilitator(s) Krutika Patel
Mentee(s)
Project Description This project is to prepare a subsection of US Tax code for Natural language translation for either https://catala-lang.org/ or ErgoAI of Coherent Knowledge http://coherentknowledge.com

We hope to get initial basic translations from US Tax code into either Catala-Lang or ErgoAI within some threshold of acceptability. Once we get the basic translations within some threshold we want to see if a local startup can complete the translations by engaging humans with expertise in the US Tax code.

This very early stage startup (Neutral Tax Networks, Greenwich CT) is in the formative stage and has a patent in this area while developing other intellectual property.

Link to House of Representatives site where US code is located. Internal Revenue Code is Title 36 (part way down on this list): https://uscode.house.gov/browse/prelim@title26/subtitleA/chapter1/subchapterA/part1&edition=prelim

Here is the IRS website page that contains the links to the internal revenue code and regulations that are provided as a public service by Cornell law schools Legal Information Institute: https://www.irs.gov/privacy-disclosure/tax-code-regulations-and-official-guidance#irc

Link to the internal revenue code sections provide dry Cornell’s Legal Information Institute (accessible by clicking on one of the links on the IRS website): https://www.law.cornell.edu/uscode/text
Project Deliverables - All legal tax code XML from https://www.irs.gov/…, https://uscode.house.gov or or https://www.law.cornell.edu/uscode/text transformed into English tax code.
- A validated algorithmic mapping from the English legal tax to a format for storing in a relational database.
- Store all the legal tax code in a relational database (MySQL) using the mapping that is suitable for translation to ErgoAI or Catala-lang
Project Deliverables
Student Research Computing Facilitator Profile They should be able learn Python or know how to code in Python or similar language.
They should be able to learn to parse XML with Python.
They should also be able to learn to work with one of several Python NLP libraries such as NLTK ( https://realpython.com/nltk-nlp-python/ )
Mentee Research Computing Profile
Student Facilitator Programming Skill Level Some hands-on experience
Mentee Programming Skill Level
Project Institution University of Connecticut - Stamford
Project Address Stamford, Connecticut
Anchor Institution CR-Yale
Preferred Start Date
Start as soon as possible. Yes
Project Urgency Already behind3Start date is flexible
Expected Project Duration (in months) 6
Launch Presentation
Launch Presentation Date 12/08/2021
Wrap Presentation
Wrap Presentation Date 06/08/2022
Project Milestones
  • Milestone Title: Capture tax code
    Milestone Description: Pulling all legal tax code XML from https://www.irs.gov/…, https://uscode.house.gov or or https://www.law.cornell.edu/uscode/text

  • Milestone Title: Transform tax code to English
    Milestone Description: Transform all XML into English legal tax code text out of the IRS XML tax-code
  • Milestone Title: Design organizational mapping
    Milestone Description: Validate a useful organizational mapping so the English legal tax code text is stored in a relational database (MySQL). This will likely require NLP processing of the tax code to make it suitable for ErgoAI or Catala-lang. This is the first part of the threshold of acceptability.
  • Milestone Title: Apply organizational mapping
    Milestone Description: Apply the organizational mapping to all of the tax code.
  • Milestone Title: Store mapped tax code
    Milestone Description: Store the mapped legal tax code in a relational database (MySQL)
  • Milestone Title: Leveraging mapped tax code for deduction
    Milestone Description: Validate the mapped legal tax code can be expressed as basic terms in ErgoAI or Catala-lang. This is the final part of the threshold of acceptability for the tax code translation.
Github Contributions
Planned Portal Contributions (if any)
Planned Publications (if any)
What will the student learn? The student will learn how to parse, transform (using XSLT), and store the transformed data.
The student will learn to parse the English tax code using Python NLP library such as NLTK.
The student will learn to organize the legal text for storage to make retrieval easy and mapping easy to either ErgoAI or Catala-lang.
The student will learn some data architecture.
This transformed/organized tax code will be stored in a relational database such as MySQL.
The student will learn SQL and how to interact with a relational database through a database workbench.
The student will learn how to work with a relational database from a language like Python.

If there is time, the student will learn about deduction in ErgoAI or Catala-lang.
What will the mentee learn?
What will the Cyberteam program learn from this project?
HPC resources needed to complete this project? No clear need for HPC.

Though the tax code is substantial so there is a possibility the NLP application may require a good deal of CPU cycles.
Notes
What is the impact on the development of the principal discipline(s) of the project? This project had a solid impact on understanding automated knowledge authoring for legal reasoning. We explored a number of ways to simply transform legal text into logical reasoning in ErgoAI (a variation of Prolog).
What is the impact on other disciplines?
Is there an impact physical resources that form infrastructure?
Is there an impact on the development of human resources for research computing? Yes - both positive impact on our student, Krutika Patel, as well as positive impact on managing student research.
Is there an impact on institutional resources that form infrastructure?
Is there an impact on information resources that form infrastructure?
Is there an impact on technology transfer?
Is there an impact on society beyond science and technology? Yes - there is an impact towards technology transfer. Besides leadership by Phil Bradford, this project was done with a Connecticut entrepreneur (Henry Orphys) as well as a faculty member (Paul Fodor) from Stonybrook University. Henry has a distinguished law and tax accounting background and he is focused on launching a startup using the technology we explored. Paul is both a faculty member as well as an entrepreneur. We isolated several challenges and better understand the resources necessary for launching a product in this space.
Lessons Learned
Overall results