Merge pull request #5 from CompNet/dev

Merge Foppa 1.1.3
CompNet · Mar 28, 2024 · 665abc6 · 665abc6
2 parents 2d3a095 + 5485ad9
commit 665abc6
Show file tree

Hide file tree

Showing 2 changed files with 28 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -1,20 +1,20 @@
-FoppaInit v1.0.2
+FoppaInit v1.0.3
 -------------------------------------------------------------------------
 *Initialization of the FOPPA database*
 
-* Copyright 2021-2023 Lucas Potin & Vincent Labatut
+* Copyright 2021-2024 Lucas Potin & Vincent Labatut
 
 FoppaInit is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. For source availability and license information see `licence.txt`
 
 * **Lab site:** http://lia.univ-avignon.fr
 * **GitHub repo:** https://github.com/CompNet/FoppaInit
-* **Data:** https://doi.org/10.5281/zenodo.7808664
+* **Data:** https://doi.org/10.5281/zenodo.10879932
 * **Contact:** Lucas Potin <lucas.potin@univ-avignon.fr>
 
 -------------------------------------------------------------------------
 
 # Description
-These scripts create the FOPPA database v.1.1.2 from raw TED files. This database relies mainly on the award notices of public contracts related to French clients and suppliers from 2010 to 2020 in the Tenders Electronic Daily. It also proposes an enrichment of these data, thanks to the siretization of agents (i.e. the retrieval of their unique IDs, which is missing for most of them) as well as the cleaning and extraction of award criteria, and other processing.
+These scripts create the FOPPA database v.1.1.3 from raw TED files. This database relies mainly on the award notices of public contracts related to French clients and suppliers from 2010 to 2020 in the Tenders Electronic Daily. It also proposes an enrichment of these data, thanks to the siretization of agents (i.e. the retrieval of their unique IDs, which is missing for most of them) as well as the cleaning and extraction of award criteria, and other processing.
 
 The process conducted to build the FOPPA is quite long, though (around 1 week, depeding on the hardware), so the produced database is alternatively directly available on [Zenodo](https://doi.org/10.5281/zenodo.7808664). The detail of this processing are described in an article [[P'23]](#references), and in further detail in a technical report [[P'22]](#references).
 
@@ -64,9 +64,10 @@ Tested with Python version 3.8.0, with the following packages:
 * [`dedupe`](https://pypi.org/project/dedupe/): version 2.0.19.
 
 # Data
-The produced database is directly available publicly online on [Zenodo](https://doi.org/10.5281/zenodo.7443842), under three different forms:
+The produced database is directly available publicly online on [Zenodo](https://doi.org/10.5281/zenodo.10879932), under four different forms:
 * SQLite file: https://www.sqlite.org/index.html
-* SQL dump.
+* SQLite dump.
+* MySQL dump.
 * CSV files (one by table).
 
 # References

diff --git a/foppaInit.py b/foppaInit.py
@@ -219,7 +219,7 @@ def databaseCreation(nameDatabase):
     cursor = database.cursor()
     request = "DROP TABLE IF EXISTS Lots"
     sql = cursor.execute(request)
-    request = "CREATE TABLE Lots(lotId INTEGER,tedCanId INTEGER,correctionsNB INTEGER,cancelled INTEGER,awardDate TEXT,awardEstimatedPrice NUMERIC,awardPrice NUMERIC,cpv TEXT,tenderNumber INTEGER,onBehalf TINYINT,jointProcurement TINYINT,fraAgreement TINYINT,fraEstimated INTEGER,lotsNumber INTEGER,accelerated TINYINT,outOfDirectives TINYINT,contractorSme TINYINT,numberTendersSme INTEGER,subContracted TINYINT,gpa	TINYINT,multipleCae	TINYINT,typeOfContract TEXT,topType	TEXT,renewal TINYINT, contractDuration INTEGER, publicityDuration INTEGER,PRIMARY KEY(lotId))"
+    request = "CREATE TABLE Lots(lotId INTEGER,tedCanId INTEGER,correctionsNB INTEGER,cancelled INTEGER,awardDate TEXT,awardEstimatedPrice NUMERIC,awardPrice NUMERIC,cpv TEXT,tenderNumber INTEGER,onBehalf TINYINT,jointProcurement TINYINT,fraAgreement TINYINT,fraEstimated TEXT,lotsNumber INTEGER,accelerated TINYINT,outOfDirectives TINYINT,contractorSme TINYINT,numberTendersSme INTEGER,subContracted TINYINT,gpa	TINYINT,multipleCae	TINYINT,typeOfContract TEXT,topType	TEXT,renewal TINYINT, contractDuration NUMERIC, publicityDuration NUMERIC,PRIMARY KEY(lotId))"
     sql = cursor.execute(request)
     request = "DROP TABLE IF EXISTS AgentsBase"
     sql = cursor.execute(request)
@@ -275,6 +275,26 @@ def firstCleaning(datas,database):
 
     # Parenthesis deletion
     datas["CAE_NAME"] = datas["CAE_NAME"].replace(regex=r'\([^)]*\)',value=r"")
+
+    # Replace "Y" by 1 and "N" by 0 on boolean columns
+    datas["B_ON_BEHALF"] = datas["B_ON_BEHALF"].replace(regex=r'Y',value=r"1")
+    datas["B_ON_BEHALF"] = datas["B_ON_BEHALF"].replace(regex=r'N',value=r"0")
+    datas["B_INVOLVES_JOINT_PROCUREMENT"] = datas["B_INVOLVES_JOINT_PROCUREMENT"].replace(regex=r'Y',value=r"1")
+    datas["B_INVOLVES_JOINT_PROCUREMENT"] = datas["B_INVOLVES_JOINT_PROCUREMENT"].replace(regex=r'N',value=r"0")
+    datas["B_FRA_AGREEMENT"] = datas["B_FRA_AGREEMENT"].replace(regex=r'Y',value=r"1")
+    datas["B_FRA_AGREEMENT"] = datas["B_FRA_AGREEMENT"].replace(regex=r'N',value=r"0")
+    datas["B_ACCELERATED"] = datas["B_ACCELERATED"].replace(regex=r'Y',value=r"1")
+    datas["B_ACCELERATED"] = datas["B_ACCELERATED"].replace(regex=r'N',value=r"0")
+    datas["B_OUT_OF_DIRECTIVES"] = datas["B_OUT_OF_DIRECTIVES"].replace(regex=r'Y',value=r"1")
+    datas["B_OUT_OF_DIRECTIVES"] = datas["B_OUT_OF_DIRECTIVES"].replace(regex=r'N',value=r"0")
+    datas["B_CONTRACTOR_SME"] = datas["B_CONTRACTOR_SME"].replace(regex=r'Y',value=r"1")
+    datas["B_CONTRACTOR_SME"] = datas["B_CONTRACTOR_SME"].replace(regex=r'N',value=r"0")
+    datas["B_SUBCONTRACTED"] = datas["B_SUBCONTRACTED"].replace(regex=r'Y',value=r"1")
+    datas["B_SUBCONTRACTED"] = datas["B_SUBCONTRACTED"].replace(regex=r'N',value=r"0")
+    datas["B_GPA"] = datas["B_GPA"].replace(regex=r'Y',value=r"1")
+    datas["B_GPA"] = datas["B_GPA"].replace(regex=r'N',value=r"0")
+    datas["B_MULTIPLE_CAE"] = datas["B_MULTIPLE_CAE"].replace(regex=r'Y',value=r"1")
+    datas["B_MULTIPLE_CAE"] = datas["B_MULTIPLE_CAE"].replace(regex=r'N',value=r"0")
 
     nameCAE = np.array(datas["CAE_NAME"])
     siretCAE = np.array(datas["CAE_NATIONALID"])